The first Italian research assessment exercise: a bibliometric perspective

In December 2003, seventeen years after the first UK research assessment exercise, Italy started up its first-ever national research evaluation, with the aim to evaluate, using the peer review method, the excellence of the national research productio…

Authors: Massimo Franceschet, Antonio Costantini

The first Italia n research assessmen t exercise: a biblio metri c p ersp ecti v e Massimo F ranceschet ∗ Dep artment of Mathematics and Computer Scie nc e, University of Udine Via del le Scienze 206 – 33100 Udine, Italy Phone: +39 0432 558754 / F ax: +39 0432 55849 9 An tonio Cos tant ini Dep artment of A gricultur e and Envir onmental Scienc es, U niversity of Udine Via del le Scienze 208 – 33100 Udine, Italy Abstract In December 20 03, seven teen years after the first UK resea rch assessment exer- cise, Italy started up its first-ever national res earch ev a lua tion, with the aim to ev aluate, using the pee r review metho d, the excellence of the national resear c h pro duction. The ev aluation inv olved 2 0 disciplinary areas, 1 02 resear c h s tr uc- tures, 1 8,500 research pr o ducts and 6,661 p eer reviewers (1,465 from abroa d); it had a direct cost o f 3.55 millions Euros and a time length spa nning over 18 months. The intro duction of ratings based on ex po s t qua lit y o f o utput and not on ex an te resp ect for parameters and compliance is an importa n t leap forward of the national r esearch ev aluation system toward merito c r acy . F rom the bibliometric p ersp ective, the national asses sment offered the unprecedented opp ortunity to pe rform a large-scale compar is on of p eer review and bibliomet- ric indicato rs for an impor tan t shar e of the Italian rese arch pro duction. The present investigation takes full a dv antage of this opp or tunity to test whether pee r review judgements and (article a nd journal) bibliometric indica tors are in- depe ndent v ariables and, in the neg ative case, to meas ure the sign and strength of the as s o ciation. Outcomes allow us to advo cate the use of bibliometric ev a l- uation, suitably int egrated with exp ert review, for the forthcoming national assessment exer cises, with the goal of shifting from the assessment of r e search excellence to the ev a luation of a verage resear ch performance without significant increase of exp enses. Keywor ds: Research asses sment , Peer review, Bibliometric s. ∗ Corresp onding author Email addr ess: massimo.francesc het@dimi.uniud.it (Massimo F r ancesc het) URL: http://users. dimi.uniu d.it/~massimo.franceschet/ (Massimo F rancesche t) Pr eprint submitte d t o R ese ar ch Policy Octob er 20, 2018 1. In tro duction In December 2 003 Italy star ted up its fir st-ever research assessment exe r cise, called V alutazione T riennale del la Ric er c a (VTR), with the a im to ev aluate the excellence o f res earch activities perfor med by universities and o ther res e arch institutions under Ministry o f Education, University , and Resear ch funding. VTR cov ered the rese arch o f 2 0 disciplinary are as during the three - year per io d from 2001 to 2003 . It inv o lved the ev aluation of 102 resea r ch structures including 77 universities, 12 public r e search ag encies, and 13 priv ate resear c h ag encies, which submitted 18,500 r esearch pro ducts for ev aluation. Peer-reviewing the submitted pr o ducts inv o lved 6,661 exp erts (1 ,465 fro m a broad), with a direct cost of 3.55 millions Eur os a nd a time length o f 1 8 mont hs. Ev aluation activities in Italian universities tr aditionally favored a bureau- cratic approach based on an ex an te check of the res pect for input, pro cesses , or compliance with pr ovisions of the law (Minelli et a l., 2008). The introduction of ra ting s based on the ex p os t quality of output a nd not o n ex ante resp ect for par ameters and compliance is an impo rtant cultural lea p forward (Bleiklie, 1998; Neave , 1998). F urthermore, the rankings comparing the p eer review ra t- ings obtained by the universities in the differen t disciplinary ar eas were p osted on the W eb 1 . This apparently plain decision is in fact unprecedent ed in the setting of Italian ev aluatio n systems in the state sector , including univ ersities, which is characterized by a gener al lack of cour a ge and a pro duction of rankings that are based on cr iteria giving lo ose indication of merit (Minelli et a l., 2 0 08; Calz` a a nd Garbisa, 1995). The VTR ev aluatio n is fully based on p eer review ev aluation metho d: each submitted res earch pro duct w as as s essed by a p o ol of exp erts who expres s ed a qualitative judgement that is then mapped to a qua n titative categoria l rating. Reale et al. (2007) show that the VTR exer cise w a s carried out on the basis of assessment criteria prop osed in the liter a ture for p eer -reviewing (rationality , reliability , impartiality , efficiency , effectiveness), controlling the presence and the relev ance of bias of the p eer judgement s (prestige of institutions and re puta tio n of scientists). Hence, we ass ume here that p eer r eviewers expressed a reliable judgement on the pr o ducts submitted at VTR a nd that this rating reflects the int rinsic quality of the pro duct. Submitted pr o ducts were autono mo usly selected b y research institutions in the measure of a t most one product every four res earchers (universities) or every t wo r esearchers (re search ag encies) c ho osing among the entire pro duction ov er a three-year p erio d. In order to maximize p eer rating, each structur e selected the pro ducts deemed to b e of highe s t qua lit y . It turned out that, for area s in which journal publication is the ro utine, most of the submitted pro ducts are journal articles and most o f these ar ticles app ear in journals indexed in da tabases of Thomson Reuters, fo rmerly known as ISI 2 . F or each article cov ere d by Thomso n 1 http://v tr2006.ci neca.it 2 At the moment , Thomson Reuters W eb of Science, El s evier Scop us, as well as Go ogle 2 Reuters, we hav e at disp osal a n article citation r ating , mea suring the num b er of citations that the ar ticle received from o ther pap ers in the databa se, and a journal citation r ating , ev aluating the impact factor of the journal in which the article app ears , which corresp onds to the av er age num be r of recent citatio ns received by pa per s published in the journal (Garfield and Sher, 1 963). F ur- thermore, for relatively large publication sets, we may compute the recently prop osed and highly celebrated Hirsc h index, which attempts to assess b oth pro duction and impact in a sing le figure (Hirsch, 200 5; Ba ll, 2007). This op ens the unprecedented oppor tunit y to p erfo r m a larg e-scale compar- ison of p eer review a nd bibliometric indicator s for the Italian r esearch s ystem. This is the a im of the pr esent c o nt ribution. More sp ecifically , we p ose the following r ese ar ch questions : 1. Ar e p eer review judgements and (article and journa l) bibliometric indica- tors indep endent v ar iables? 2. If not, what is the streng th of the a sso ciation? 3. In particular, is the asso ciation b et ween p eer judgement a nd ar ticle cita- tion rating significantly stronger than the asso ciation b etw een peer judge- men t and jour nal citatio n rating? Answering these que s tions is of crucial impo rtance to e v aluate the opp or tu- nit y of using bibliometrics in the nex t resear c h as sessment exercises. In Section 2 w e c oncisely describ e the VTR assessment exercis e. In Section 3 we a ddress the p osed q ue s tions with a careful analys is co mparing p eer r e v iew and bibliometric indica tors at both levels of research dis c iplines (Section 1) and resear ch structur es within disciplines (Sectio n 3.2). Related work is amply survey ed in Sectio n 4 . Finally , in Section 5 we dr aw some conclusio ns. 2. An ov erview of VTR VTR was manage d by the Committee for the Ev a luation of Res earch (CIVR) and was designed as an ex p ost asses smen t exercise ba sed on p e er r eview . Its plan can b e s ummarized as follows. CIVR divided the natio nal resear ch system int o 2 0 scientific-disciplinary areas, including 6 int erdisciplinar y secto rs, a nd set up an ev aluatio n panel r esp onsible for the a ssessment of each a rea. P anels were comp osed by high level exp erts (panelists), whic h n umber fluctuated from 5 to 17 acco rding to the ar ea size and dis ciplinary complexity . The exercise was then articulated in three phase s , that were in charge of research structures, pa nels and CIVR, resp ectively . In the initial phas e, r esearch institutions submitted to panels a set of au- tonomously se le cted r esearch pr o ducts . Typ e s of pro ducts admitted to submis- sion are: journal articles, bo oks, b o ok chapters, proc eedings of national and int ernational conferences, paten ts, designs, p erfo r mances, exhibitions , manu- factures and art op era s. The only mandatory principle of selection stated Sc holar are the main multi-disciplinary bibliometric data sources. 3 that pro ducts of research should not exceed 50% of the full-time-equiv alent resear chers in the institution. 3 The research s tructures submitted an overall sample of 1 8 ,500 pro ducts partitioned as follows: jour nal articles 72 %, b o oks 17%, b o ok chapters 6 %, patents 2% and the remaining typolog ies 3 %. E v alu- ated pro ducts were more than 17,300 (there are pro ducts submitted b y mor e than one institution). Resear ch structures were a ls o demanded to transmit to CIVR data and indicator s a bo ut hu man resource s , international mobility of resear chers, funding for resea rch pro jects, patents, spin-off and par tnerships, allowing to reveal impact on employmen t. In the second phase of the exer cise, which was carried out with the aid o f a web platform, panelists assigned resea r ch pro ducts to exter nal referees. Each pro duct was assessed by at least tw o referees who pe er-reviewed it accor ding to four a spec ts o f mer it: quality (the opinion of p eer on the scien tific ex cellence of the pro duct co mpared to the international standar d), imp or tance, origina lit y and internationaliza tio n. Referees a lso expressed a final score on the following four-p oint scale: 1. ex c el lent : a pr o duct within the top 2 0% of the v alue in a scale sha red by the international scient ific communit y; 2. go o d : a pro duct in the 6 0 %-80% segment; 3. ac c eptable : a pro duct in the 4 0%-60% se gment ; 4. limite d : a pro duct within the b ottom 40%. F or every ev aluated product panels drew up a cons ensus rep ort wher e pan- elists r e -examined the p eer judgments and fixed the final scor e. F urther more, CIVR weight ed the peer review scores a s follows: 1 (excellent), 0.8 (g o o d), 0.6 (acceptable), and 0 .2 (limited). The numeric formulation made it p ossible to sum pr o duct scores, in order to obtain a mean ra ting for single research struc- tures providing a proxy for the v alue of the institution research per formance and the p ossibility to compile corres p onding rankings of structures. Ranking s were compiled for each disciplinary ar e a a nd within g r oups of structur e s of co m- parable sizes: mega str uc tur es (more than 74 pr o ducts), larg e structur es (25-74 pro ducts), medium structures (10-24 pro ducts), and sma ll structures (less than 10 pro ducts). Panels provided a final re po r t including r a nking lists of the insti- tutions in the survey ed area, highlighting strength and w e a kness points of the resear ch area, and pro po sing p oss ible actions of improvemen t. In the final phase of the assessment exercis e, CIVR pro duce d a detailed analysis of requested data a nd indica tors, integrating panel re p or ts with co l- lected data abo ut human resources a nd pr o ject funding. The CIVR final r ep o rt defines a first-ever comprehensive as sessment of the natio nal res earch system. In summer 2009 , VTR o utcomes have b een used for the first time b y Ministr y 3 A full-time-equiv al ent r esearc her represen ts 0.5 researc hers in unive rsities, where r e- searc hers teac h as we ll, while it corresp onds to 1 researcher in research agencies. Hence, univ ersities were allow ed to submi t a maximum num b er of pro ducts corresp onding to 25% of the three-year av erage permanent academic staff. 4 of E ducation, Univ er s it y , a nd Resea rch to allo cate a 7% share of the O rdinary F und for Higher Educa tion (FF O ). 3. A bibl iometric analysis of VTR Our analysis consider s the following resea rch areas: 1. ma thematics and computer sciences (MCS); 2. physics (PHY); 3. chemistry (CHE); 4. ea rth sciences (EAS); 5. bio logy (BIO); 6. medica l sciences (MED); 7. a gricultural sciences and veterinary medicine (A V M); 8. civ il engineering and a rchitecture (CEA); 9. industr ial and informatio n eng ineering (I IE); 10. eco nomics and statistics (ECS). W e excluded from our inv estiga tion the six interdisciplinary a reas as well as the following four ar eas: philological-liter a ry sciences, a n tiquities and arts; history , philosophy , psyc hology a nd p edagog y; la w; p olitical and so cia l sciences. The num b er of submitted pro ducts in these areas that ar e cov ered by Thomso n Reuters databases is to o mo dest for a r eliable application of bibliometrics. In the following, w e r efer to a pro duct contained in Thomson Reuters databases as a T ho mson Reuters (TR) ar ticle. F or ea c h submitted pro duct we have at disp o sal a peer review judgement. Moreov er, for ea ch TR a rticle we computed the following bibliometric indicators: 1. article citation ra ting , counting the n um be r of citatio ns that the article received from other TR pap ers. W e retrieved all c ita tions recorded in Thomson Reuters W eb o f Science data ba se r e c eived by more than 17 ,0 00 pap ers up to June 2006. Since pap ers refer to perio d 2001-20 03, this means that w e used a citation window of minim um length of 2.5 years, maximum length of 5.5 years, and average le ng th o f 4 years. These per io ds are genera lly sufficient for a paper to collect the peak of citations in e a ch of the surveyed disciplines; 2. journal citation r ating , ev aluating the av erage num b er o f recent citations received by pa p er s published in the journal in which the article appea rs. W e computed the a verage 2-year journal impact factor over the p erio d 2001- 2003. F urthermore, we computed the Hirsch (h) index ov er rela tiv ely la rge sets of pap ers. The h index for a publication s e t is the highest num ber n such that there are n pap e rs in the s e t ea ch of them received at least n citations (Hirsch, 2005). The h index immediately fo und interest in the public (Ba ll, 2 007) and in the bibliometrics literature (see Bor nmann and Daniel (200 7b) for o ppo rtunities 5 area size co v auth o wn p eer cites IF h MCS 787 92% 2.26 69% 0.830 (0.831 ) 3.97 (3.54 ) 1 .12 18 PHY 17 6 7 89% 51.8 5 42% 0.879 (0.885 ) 24.66 (4 .26) 5.79 87 CHE 1089 92 % 5.10 68% 0.80 7 (0.8 13) 16.14 (3.14) 5.1 4 5 0 EAS 651 9 0 % 4.17 64% 0.82 5 (0.8 36) 7 .33 (2 .44) 3.01 26 BIO 1575 96% 6.56 66% 0.826 (0.83 1) 24.58 (2.90) 8.4 8 83 MED 263 9 96% 8.4 7 59 % 0.776 (0.780) 26.65 (3.20 ) 8.34 106 A VM 750 8 9 % 4.81 67% 0.71 2 (0.7 28) 8 .20 (3 .08) 2.66 27 CEA 758 4 5 % 2.40 84% 0.75 0 (0.7 55) 3 .58 (3 .10) 1.16 14 I IE 1195 82 % 3.48 77% 0.77 4 (0.7 79) 4 .78 (2 .9 8) 1.61 23 ECS 971 5 4 % 1.86 76% 0.67 3 (0.7 99) 3 .16 (3 .63) 0.87 17 T able 1: Analysis at the lev el of research discipline. and limitations o f the h index). In particular, it is currently computed by bo th Thomson Reuters W eb of Science and Elsevier Scopus bibliometr ic data sources. The index is meant to ca pture b o th pr o duction and impac t of a publication s et in a sing le fig ure. It favors publication sets co nt aining a co n tin uous stream of influent ial works o ver those including many quic kly forgotten ones or a few blo ckbusters. Moreov e r , the index is robust to self-citations: a ll self-citations to pap ers with less than h citations a re irr elev ant for the computation o f the index, as are the self-c itations to pap ers with many mor e than h citations. W e aggrega ted p eer r eview and bibliometric da ta at both lev els of research disciplines (Section 1) a nd r esearch structures within disciplines (Section 3.2). 3.1. Anal ysis at the level of r ese ar ch discipline T able 1 co ntains, for each survey ed discipline, the following columns: • ar e a : the disciplinary ar ea a bbreviated as ab ov e; • size : the num b er of submitted pro ducts. 4 This gives an indication of the size (num b er of researchers) of the field; • c ov : the fra ction of submitted pro ducts that are covered in TR da tabases; • aut h : the mean n umber of authors p er pap e r. W e interpret this as a measure of discipline pr op ensit y of collab or ation among scholars; • own : deg ree o f ownership. F or a given pap er submitted by a given str uc- ture, it is the num b er of pape r author s that are affilia ted to the structure that submitted the pap er divided by the total num b er of pap er authors . It demonstra tes the discipline prop ensity o f collab oration with scholars of 4 Pa per s with authors affiliated to s tructures belonging to different areas are count ed for eac h affil iation area. 6 different resear ch structures (be lo nging to the sa me or different fields): the low er the degr ee of ownership, the higher the inter-structure co lla bo ration prop ensity . • p e er : the av e rage peer review rating. Within brack e ts we show the rating ov er TR ar ticles only; • cites : the average n umber of received citations. Within brackets we show the ratio b etw een num ber of cita tions and impact facto r; • IF : the av erage impact factor of the jour nals publishing the pap ers; • h : the h index for the set of submitted TR pap ers . The largest a r ea is MED, follo wed by PHY and BIO; small fields are EAS, A VM, CEA and MCS. All areas ha ve a larg e TR coverage with tw o notable exceptions: CEA (45%) and ECS (54%); importa nt sub-fields of these areas frequently publish on bo ok s, which are no t cov er ed b y TR. Mo r e precisely , CEA groups civil enginee ring and ar chit ecture; the former mostly publish in jour nals and has a go o d TR cov er age (7 5%). O n the cont rary , scholars in architec- ture frequently publish b o oks a nd b o ok chapters a nd hence the TR coverage is limited (3%). It follows that, for the purp ose of this study , the o utput of area CEA is la rgely dominated by civil e ngineering pro ducts. As for ECS, it is mainly co mpos ed of eco nomics, management , and mathematics. Scholars in mathematics and economics publish mostly in jour nals, but these are differ- ent ly covered by TR (56 % in eco nomics versus 78% in mathematics). Scholars in manage ment prefer b o oks or bo o k chapters, r educing the T R cov era ge to 22%. Computer scientists t ypically prefer co nference pro ceedings to a rchiv al journals as a mean of publicatio n but typically journals convey a higher impact (F ranceschet, 2010a,c). Although TR does not index confere nc e pro ceedings (at least it did not at the time o f the assessment exer cise), TR co verage of MCS is reasona bly high. This bec a use co mputing structures submitted for ev aluation mostly jour nal pap ers instead of the more fr e q uen t pro ceeding pap ers, probably bec ause they p erceived that these publications a re of higher qua lit y . The mean num ber of authors v a ries across disciplines. PHY, MED, and BIO are the fields with the la rgest nu mber o f authors p er pap e r , while ECS, MCS, CEA a re the areas with the low es t a uthorship prop ensity . Notice that PHY is a significa tiv e outlier: on av erage, pap ers in this discipline have more than 5 0 authors . A clo ser lo ok to the authorship dis tribution r eveals that it is highly skewed: there ar e ma n y pap ers w ith few author s and few ones with a hu ge num ber of author s. The median n um b er o f author s is 5 , meaning that at least 50% o f the pa per s have at mo s t 5 autho r s, a figure compar able with o ther disciplines. On the other hand, 13 % of the pap ers have mo re than 1 00 author s, and there exists a hub pap er with the impres sive n um be r of 1412 co-author s. This phenomenon, k nown as hyp er aut horship and t ypica l of certain a reas of resear ch including high energ y ph ysics, is inv es tigated in Cronin (200 1). 7 W e observed a sig nifica tive negative correla tion betw een authors hip and own- ership 5 : the lar ger the n umber o f authors p er pa per , the lower the ownership degree of pap e rs, indicating a stronger propens it y to collabo rate outside the home institution. F or instance, more than half of the author s of pape rs in PHY belo ng to a different structure with res pect to the submitting one. At the other extreme, a uthors in CE A a nd, to a less extent, those in I IE and E C S, pr e fer to work in small gr oups within their research structures . Peer review judgements w ere, on a verage, quite high, reflecting the selection of the b est pap ers only provided by each structure in each discipline. Moreover, the av era ge judgement ov er all pro ducts corres p onds to the mea n judgemen t with resp ect to TR a rticles only , with the exceptio n of ar e a E CS: in this field TR articles hav e been ev a luated significantly higher than non-TR pro ducts. The fields with the b est peer ratings are PHY, MCS, BIO, and EAS in this order. The area s with the po or est peer judgemen ts are ECS a nd A VM. In the ca se of economics and statistics, an explanation o f the bad p erforma nc e is the high frequency of non-TR pr o ducts which received a low p eer rating. F urthermore, Reale et al. (2007) claim that the low er levels of ra ting for this area a re a lso asso cia ted with the higher dis agreement of the panel consensus in this sector with respect to the others. As for A VM, the ratings of its sub-fields are: ag ronomy (0.67 8), en tomology (0.681 ), veterinary science (0 .684), fo o d and nutrition (0.697), animal science (0.7 20), plant science (0.721), ag ricultural chemistry (0.757). Based on av aila ble rating s, animal science, plan t s c ience e agricultura l chemistry tend to be in line with situatio ns of go o d scientific quality , but the other sub-fields r ank be low the national standa rd. Bibliometric indicator s cores wildly v ar y a cross fields. This field effect is a w ell known phenomenon in bibliometrics (see, e.g ., Althouse et a l. (2 008)). This is mainly due to the differen t field publication cov erage of the underly- ing bibliographic databases a nd to the differe nt field citatio n habits, including nu mber o f r e ferences p er pa per and citatio n sp eed. The ratio be tw een article citation and impact factor s cores is suppose d to mitigate the field effect. It tells us something about the ability o f the institutions from different fields to sele ct the pap ers with the hig hest p o ten tial impac t. In this resp ect, PHY was the b est area and EAS was the worst. T ables 3 and 4 in Appendix analyze the v ariables size (num b er of pap ers), av er age n um b er of citations per paper , av er age journal impact, and h index across sets o f pap ers characterized by different levels of (pee r re view) quality . The size factor gives more insight into the area ov erall peer judgement (column pee r in T able 1 ). F or instance, pap ers in PHY received the highest pe er review judgement s (0.879 on a verage). Indeed, more than ha lf (52 %) of them have been judged excellent, while only 1% of them have been considered limited pro ducts. On the other hand, p eer reviewers were v ery critical with r esp e ct to pro ducts in ECS (the av erage rating is 0.673 ): only 17% of the pro ducts in this a rea are excellent works, and a higher shar e, 18%, ar e c onsidered limited cont ributions. 5 Spearman coefficient - 0.82, p-v alue 0. 007. 8 Notice that, for all a reas but PHY, the most p opular r eferee opinion is go o d. Article citations are positively correlated with the catego rial p eer rev iew judgement : g e nerally , the av era ge num b er of citations pe r pap er decr eases as the p eer rating declines. Excellen t pap ers alwa ys r eceive the highest av er age nu mber of citations , well above the discipline mean, while acceptable and limited contributions received an av erage citatio n impact lower than the discipline mean. Nevertheless, s ome exceptions to positive correlation exist, namely the impact of limited pro ducts in EAS (+2 pos itio ns in the categor ial ranking), MED (+1), CEA (+1), and I IE (+2). Journal impa c t factors ar e also p ositively correla ted with p eer ass essment: on av erage, the impact fac to r of publishing journals drops a s the p eer ev a lua- tion decrea ses. The asso ciation is , how ever, not a s strong a s the one noticed for article citations. Indeed, there ar e mor e exc e ptions to positive asso c iation, namely the acceptable pa per s in MCS (+1 p os itions in the categ orial ra nk ing) and E AS (+1), and the limited pr o ducts in PHY (+ 2 ), ME D (+2), CEA (+1), and IIE (+2). The h index discriminates very well be t ween different p eer review ratings with only t w o exceptions: excellent and go o d paper s in fields A VM and CEA. In particular, the h index nea tly s eparates the lower judgements acceptable and limited, on whic h the discr imination p ow er of b oth article and journal citation measures is w eaker. T a ke, for example, the sets of acceptable and limited pap ers in MED. Both the average num b er of pap er citations and the av er age journal impact factor for limited articles are a bove the sa me mea sures for ac ceptable pap ers. On the other ha nd, the h index of acceptable pap ers (36) lar gely dom- inates that of limited a rticles (21 ). Indeed, the sor ted citation se quence fo r acceptable publications features a longer stream of influen tial pap ers while that for limited pa p er s is headed by tw o blo ckbusters, which are res pons ible fo r the relatively high mea n cita tion v a lues, but then it quic kly dec r eases. The rela tio nship betw een p eer judgements and bibliometric indicators, in particular article and jo ur nal citation indices, has bee n further inv estigated. Within ea ch discipline, we expressed the discrete v ariable a rticle citation as a categoria l v a riable b y splitting the distribution into quar tiles to obtain a four- po in t sc a le for the v aria ble. W e did the same for journa l impact factor. Then, we prepared, fo r each discipline, a contingency table displaying the ca tegorial v ariables p eer judgement and article citation (T a bles 5 and 6 in App endix) and a similar table for p eer judgement a nd journal impact factor (T ables 7 and 8 in Appendix). Each table cell contains the joint re la tive frequency for the conditional distribution of the bibliometric v a riable (either article citation or journal impact factor ) giv en the p eer judgement v a riable. F or example, T a ble 5, discipline BIO, shows that excellen t pap e r s in the discipline a re split int o citation qua rtiles a s follows: 11.3% in the 1st quartile, 18.4 % in the 2 nd q ua rtile, 25.3% in the 3rd qua rtile, and 45.0% in the 4 th qua rtile. It turns out tha t, with very few exc eptions, the ma jority of excellent pa- per s are a sso ciated with the hig he s t bibliometric qua rtile (the 4th one), while the ma jority o f limited pro ducts b elong to the lowest bibliometr ic quartile (the 1st one). Goo d and acceptable pa per s distribute ov er the four quar tiles, with 9 a prefer ence for the low er segment s, in par ticular for acceptable pro ducts. If bibliometric and p eer a ssessments were indep endent v ariables, we w ould exp ect that the r elative frequency of e a ch cell would be the pro duct o f its mar ginals (the r ow and column r e la tive frequencies). Hence, we can test the independence of the bibliometr ic and p eer revie w v ar ia bles by comparing the o bserved fr e- quencies with the e x pected o nes in case of independent v ariables (this is the well-kno wn P earson c hi-square test for independence ). The output o f the test is that, fo r all disc iplines , p eer judgement and bibliometric indicato rs a re not independent v ariables (with a sig nificance level less than 0.00 1 ) with the unique exception of journal impact factor for area MCS. The strength of the asso cia tion betw een p eer opinion and ar ticle citation v ar ia bles, measured with Sp earma n’s rank-or der coefficient, ranges from 0 .187 for I ID to 0.4 03 for PHY. All v alues are s ignificantly differe n t fro m 0 (p-v alue < 0.001). The as s o ciation b et ween pee r judgement and journal impact factor ranges fr o m 0.19 7 for I ID to 0.529 for A VM. All v alues except that for MCS ar e significantly different from 0 (p-v alue < 0.001 ). T o conclude the investigation of asso c ia tion b etw een peer review and biblio- metrics, we p erfor med a probabilistic analysis (T ables 9 and 10 in Appendix). Namely , for ea ch pair of adjacent p eer judgmen ts X and Y , we co mputed the probability P ( c ( X ) > c ( Y )) (resp ectively , P ( c ( X ) = c ( Y ))) that for t wo ran- domly drawn paper s P and Q rated X and Y , resp ectively , the num b er of citations of P is gr eater than (resp ectively , equal to) the num b er of citations of Q . If p eer judgmen ts are p ositively c o rrelated with article citations, an ed- ucated gues s would b e that, if ra ting X is a bove Y , then P ( c ( X ) > c ( Y )) is larger then P ( c ( Y ) > c ( X )). It holds that P ( c ( X ) > c ( Y )) can b e express ed as the following ratio : P ( c ( X ) > c ( Y )) = |{ ( P, Q ) . r ( P ) = X and r ( Q ) = Y and c ( P ) > c ( Q ) }| |{ ( P, Q ) . r ( P ) = X and r ( Q ) = Y }| where r ( P ) is the rating of P , c ( P ) is the num be r of citations received b y P , and | · | is the cardinalit y of a set. Clear ly , we hav e that P ( c ( X ) > c ( Y )) + P ( c ( Y ) > c ( X )) + P ( c ( X ) = c ( Y )) = 1 Similarly we computed the probabilities P ( I F ( X ) > I F ( Y )) and P ( I F ( X ) = I F ( Y )) for the journa l impact factor. W e o bserve that for pairs o f judgements (E,G) and (G,A), the num b er o f pairs of articles whose citations are conco rdant with the judge men ts is alwa ys greater than the num b er of discordant pairs of pap ers : the higher peer rating, the hig her the probability of finding highly cited pap e rs a s well a s that of finding pap ers published in journa ls o f high impact. F or the rating pair (A,L) the situation is more co n trov e r sial: in fo ur cas es o ver ten, the exploited bibliometric indicators are less accurate at dis tinguishing acceptable pa per s from limited ones. By wa y of example, Figure 1 illustr ates the found asso cia tion be tw een citations and p eer assessment for res earch area BIO. The pro bability that an excellent pap er receives more citations than a goo d one is 0.68 (as oppo sed to 10 0.2 0.4 0.6 0.8 1.0 0 50 100 200 300 peer cites Figure 1: Categorial scatter plot showing citations received by papers of different p eer-assigned qualit y for research area BIO. The sol id line connect s the mean num b er of citations for eac h group. Pa p er s of higher qualit y generally receive more citations. 0.30 for the probability o f the opp osite even t), the probability that a go o d pap er collects more citations tha n a n acceptable o ne is 0.64 (as opp osed to 0.33 ), and the proba bilit y that an acce ptable pap er harvests more citations than a limited one is 0.59 (as oppo sed to 0.35). F urthermore, in 88 % of the cases a pap er r ated excellent receives more citatio ns than a pap er judge d limited, w hile in only 10% of the ca ses the opposite happ ens (2% of the times the tw o pap ers receive the same num b er of citatio ns). 3.2. Anal ysis at the level of r ese ar ch structu r es In this section w e in vestigate the structure ra nkings within each disciplines compiled with r esp ect to p e e r review judgements and bibliometric indicators. F or the sake of statistical significa nce, for each discipline, we included in this analysis o nly res earch entities that submitted at lea st 10 pro ducts b elonging to the discipline. F or each structure in each discipline we compute the following ratings: • p e er r eview r ating : this is the av erage p eer review judgment of the pro d- ucts submitted by the structur e; w e also consider the p eer review judgment restricted to TR ar ticles; • article citation r ating : this is the av er age num be r of citations received by TR articles submitted by the structure; • journal citation r ating : this is the av er age impact factor of journals that published the TR a rticles submitted by the structure. 11 area p eer vs. cites p eer vs. IF σ p-v alue σ p-v alue MCS 0.46 0.015 0.52 0.005 PHY 0.81 < 0.001 0.29 0.088 CHE 0.60 < 0.001 0.85 < 0.001 EAS 0.79 < 0.001 0.34 0.140 BIO 0.69 < 0.001 0.74 < 0.001 MED 0.56 < 0.001 0.60 < 0.001 A VM 0.52 0.015 0.52 0.015 CEA 0.32 0.124 0.41 0.043 I IE 0.58 < 0.0 01 0.38 0.03 6 ECS 0.42 0.006 0.4 5 0.003 T able 2: Rank-order correlation b et we en structure rating v ar iables: peer review rating of TR articles (p eer) is compared to article citation rating (cites) and to journal citation rating (IF). W e show the Sp earman r ank-order correlation coefficient ( σ ) and the significance of the test (p-v alue). Univ ersities were allow ed to submit a maximum num ber of products co r- resp onding to (only) 25% of the three-year a verage permane nt academic staff. Research institutions were par titioned according to the n umber of submitted pro ducts in mega structures (over 74 submitted pro ducts), large s tructures (from 25 to 74 products), medium structures (from 10 to 2 4 pro ducts), and s ma ll structures (less than 10 pro ducts). Except for mega structure s , the num ber s of submitted pro ducts are, in general, not sufficient for a reliable computation, at the structure lev el, of the h index, whose score , by definition, is bo unded by the nu mber o f pap ers in the ev aluatio n set. F or this reaso n, w e do not consider the h index in the pre s en t analysis at the level of r esearch structures . W e p erfo r med a r ank-or de r cor relation analysis to compare the structure compilations according to p eer review and bibliometric ra ting s. W e tes ted the hypothesis that the Sp earma n correlation co efficient is differ e nt fro m n ull and, when it holds, we inv estig ated the strength of the corr elation. T a ble 2 gives the main outcomes for the analys is . The used p eer rating refers to TR ar ticles o nly . The outcomes are summar ized in the following: • Ther e is an o verall p ositive corr e la tion betw ee n p eer ra ting and ar ticle citation rating a t the structure level. In particular, for six a reas, namely PHY, CHE, E AS, BIO, MED, and I IE, the correla tion is significant at a level le ss than 0.0 0 1, and for MCS, A VM, a nd ECS the co rrelation is significant at a level of 0 .02. 6 On the other hand, the cor r elation is not significant for ar ea CE A. • A highly significa tiv e correlatio n b etw een p eer r ating and journal citation 6 A correlation is considered significant when the p-v alue i s less than or equal to 0.05. 12 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 peer rank bibliometric rank Figure 2: Rank plot comparing p eer and article citation ratings for structures in research area BIO. F or each structure, the rank of the structure according to p eer rating is plotted against the structure rank according to article citation rating. Peer rating f av ors structures ab o ve the solid bisector li ne and hamp ers those below, whil e those on the line do not change their ranks in the tw o compilations. rating is less frequent: in only cases, CHE, B IO, and MED, the corre la tion is s ignificative a t a lev el les s than 0.001. The a sso ciation is significant at the level of 0.0 5 for areas MCS, A VM, CEA, I IE, E CS. The co rrelation is not significant for areas PHY and EAS, which are the areas with the highest ass o ciation with respect to article cita tion. By wa y of example, Figure 2 contains a rank plot co mparing p eer and article citation ratings for structures in r esearch area BIO (35 structures that submitted at least 10 pro ducts). In gener al, the structure r ank in the citation compila tion increases as the structure rank in the p eer compilation rises. The median change of rank is 4 (11 % of the compilation length). Peer review, co mpared to citation rating, mostly fav o urs structures Milano (+15 p ositions with resp ect to the citation compilation), T ren to (+9), L’Aquila (+9 ), and Roma T re (+8). O n the other hand, structures that are most adv antaged b y using the bibliometric ranking a re Genov a (+11 p ositio ns with resp ect to the p eer compilatio n), Pavia (+11), and T r ent o (+11). Institutions Roma La Sapienza, P alermo, E NEA, a nd Parma do not change their po sitions in the tw o listings. 4. Related work The most famous and discussed Europ ean national resear ch ev aluatio n is the Research Assess men t Ex ercise (RAE) in Great Britain, whic h started in 1986. 13 It is a p eer revie w ev aluation dealing with a pproximately half o f the to tal p ort- folio of r esearch outputs of the assess ed institutions – for RAE 200 8, resear ch structures were in vited to submit four r esearch pro ducts for each full-time re- searcher (as opp osed to one pro duct every four r esearchers in the Italian RAE ). Int erestingly , it was announced that, a fter 20 08 edition of the e x ercise, a sys- tem of citation-based metrics will b e introduced to inform and supplement p eer review where ro bust data are av ailable – most likely in medicine, s cience, and engineering – with the goals of achieving consistency , international be nchmark- ing, and where p ossible reducing workloads. The firs t new exercise, renamed Research Excellence F r amework (REF), is due to b e c ompleted in 20 1 3. In the US, evidence s ug gests that publication and citatio n metr ics ar e more readily accepted and mor e libera lly applied. Cronin (199 6) cites the following example to illustrate the gr eater toleranc e of ev aluative bibliometrics in Nor th America: In a r e c en t le gal action initiate d by a female assistant pr ofessor of biolo gy, who had b e en denie d tenur e at V assar Col le ge, the plaintiffs lawyer br ought forwar d as evidenc e of discrimina tion t he fact that her untenu r e d client had a higher citation c ount than some tenur e d male s t aff in the same dep artmen t . A lthough t he female c andidate’s c ase was overturne d subse quently on app e al (in p art, and ir onic al ly, as a r esu lt of err ors in the citation data submitte d as evidenc e), the le gal admissability and p otent ial c ourtro om imp act of citatio ns ar e worthy of note. The litera ture offers more than a few co n tributions dedicated to the compar - ison of peer review and biblio metric ev a luation metho dologies. The following is a (necessa rily incomplete) selection. The use of cita tion metrics in place of, or as a supplement to, the UK RAE ha s b een considered extensively . F or instance, Opp enheim and Norris (2003) obse r ve a statistically significant corre lation be- t ween the 200 1 RAE res ult and citation counts for ar ch eology , a nd contains references to o ther studies that hav e found p ositive asso cia tions for other fields and exercises. v an Raa n (200 6) inv estiga tes the statistical corre la tion b etw een different bib- liometric indicators, including the h index and the ‘crown indica to r’ (a citation av er age norma lized to world av er age, a mea sure developed and implemented by the a uthor’s gro up at Leiden) with p eer review judgement for university chem- istry r esearch g roups in the Netherlands. Results show that the h index and the crown indicator b oth relate in a quite co mparable w ay with p eer judgemen ts. In particular, b oth indicators discriminate very w ell b etw een highly ra ted gr oups and po o rly rated ones, but less well b etw een go o d and e x cellent judgements. Bornmann and Daniel (2007a) in vestigate the conv ergent v alidity of deci- sions for aw a rding lo ng-term fellowships to pos t-do ctoral researchers as pr ac- ticed by the Bo ehringer Ingelheim F onds – an international fo undation for the promotion of bas ic res earch in biomedicine – b y using the h index. Grant and fellowship peer review is pr inc ipa lly an ev aluation of the p otential of the pro- po sed research. The h indices of approved applica nt s are o n average consistently 14 higher than those of r ejected applicants. Nevertheless, the distributions of the h indices par tly ov er lap: so me rejected applicants ha ve a h index that is s ubs tan- tially higher than that of approv ed a pplicants, and so me approv ed applicants hav e a h index tha t is substa ntially low er than that of rejected applican ts. Rinia et al. (1998) study the correlation b etw een bibliometr ic indicator s and the outcomes of p eer judgement s o f resea rch progr ammes made by exp ert com- mittees of co ndensed matter physics in the Netherlands. In particular , a break- down of correla tions to the level of differ en t peer revie w cr iteria has been made. The authors dr aw a num b er o f interesting co nclusions. Positive and significa n t but no perfect co rrelations are found betw een a num be r of bibliometric indica- tors (in particular average n umber of citations per publicatio n and the ab ov e men tioned cr own indicato r) and p eer judgements of resea rch pr ogrammes. The impact o f publication journals , as reflected by the mean journa l citatio n ra tes, do es not cor relate well with the qua lit y of these programmes a s p erceived by pee rs. A neg ative cor relation is found betw een the p ercentage of self-citations and jury ratings. Cor relations b etw een bibliometric indicators and exp ert judge- men ts are hig her in the ca se of ‘curio s it y driven’ bas ic research than in the ca se of ‘application dr iven’ resea rch. Finally , a t the le vel of s pecific criteria used by juries, the highest co rrelation is found b etw een ratings for bibliometric indica- tors and the cr iterion ‘team’ – the assess men t of the c ompetency o f r esearchers and of the res earch team. Asknes and T ax t (2004) inv e s tigate the rela tionship betw een bibliometric indicators and the o utco mes of p eer reviews based on a case study of resear ch groups within the na tural sciences at the Univ e rsity of Berg e n, Norwa y . The analysis sho ws p os itive but relatively weak corr elations. Groups obtaining the highest citedness indices were a ll r ated as very g o o d or ex cellent . On the o ther hand, groups cited b elow the world average obtained rather heterog eneous ra t- ings. The authors conclude that peer review and bibliometric metho ds should be used in combin ation. In particular, in cases wher e there is a sig nificant de- viation betw ee n the t wo ev aluation outcomes, the pane l should in vestigate the reasons for these discre pa ncies. The pr eceding compariso ns are limited to only a few disciplinary sector s or to just one sector, o r ev en to a single ins titution. By cont rast, o ur in vesti- gation spans ov er 10 disciplinary area s in the sciences and so cial scie nc e s and inv olves the output of mor e than a hundred public and priv a te research struc- tures. Tw o contributions mostly rela te to ours. Reale et al. (20 07) analyse the output of Italia n VTR for four a reas: chemistry , biolo gy , economics and h uman- ities. The authors find a g eneral consensus betw een exp ert advice (but weak er in economics) and show that p eer re v iew w as not biased tow ar d prestige of insti- tutions or r eputation of sc ie n tists. On the o ther hand, they notice a bias linked to the in terdisciplinary (non-conv entional) rese arch. F urthermore, the authors per form a Spea rman c o rrelation analy sis as well as an o rdinal reg ression one to compare p eer judgements of pap ers with the impact factor of jour nals publishing the pa per s for chemistry , biolo gy , a nd e c onomics a reas. They find a statistically significant asso cia tion, a ltho ugh not s trong, and conclude tha t “ this r einfor c es the ide a that imp act factor is a go o d pr e dictor of the quality of journals – not 15 for the quality of articles publishe d in a p articular journal ” . Finally , they s ug - gest that “ further develop ments of VTR should go towar d a lar ger use of the bibliometric indic ators, in c onjunction with p e er r eview ”. Abramo et al. (20 09) pr ovide a broa der inv es tig ation o n Ita lian VTR out- comes for eight disc iplines, the ten dis ciplines we hav e used in our study with the ex c lusion of civil engineering a nd a rchitecture (CEA) and ec o nomics a nd statistics (ECS), for whic h the databa se c ov era ge is les s imp or tant . The au- thors corr elate, at the research structur e level, peer quality opinio ns on pa p er s with metrics based on the impact fa c tor of the jo ur nals publishing the pap e rs, normalized across scientific disciplina ry sectors within disciplinary a reas. They conclude that the tw o ev aluation methods (p eer review and bibliometrics ) sig- nificantly overlap for the surveyed fields, and that “ bibliometrics curr ent ly offer levels of p otential and metho dolo gic al maturity that should induc e a r e c onsider- ation and r evision of t heir r ole .” F ur thermore, the study shows that, with the bene fit of hindsight, Italian universities, in the ma in, did not identify and submit for ev alua tio n their b est publications in ter ms of cita tio nal impact. Finally , the authors g ive evide nce that resea r ch structur es indicated as b eing o f top qua lit y by VTR are not necess arily also the most pro ductive o nes. The main differe nce b etw een the tw o mentioned previo us studies and ours is the set of bibliometric indica to rs w e hav e co n trasted to p eer judgements. Besides the journal impa c t factor, we us ed the num ber of cita tions c ollected b y individua l pap ers, which directly relates to the p otent ial impact of pa p er s , a nd no t to that of publishing sources , as well as the h index for relatively large publication sets. It is w orth remem b e r ing that the jo ur nal impact factor was conceived a s a measure of jour nal status, and not of impact o f single pa per s published within it (see Garfield (2006) and Pendlebury (2009) for recent additions to this incessant debate). In par ticular, citation distributions consider e d in the computation of journal impact factor s are always se verely sk ewed, meaning that the ma jority of the pa per s in the journal are cited m uch less than the mean represented by the impac t factor (Segle n, 199 2; Campb ell , 200 8). F ur thermore, we provided inv estiga tion bo th a t the level o f research disciplines (the r atings of pap ers ) and at the level of r esearch str uctures (the r atings of institutions submitting the pap ers). W e included in the analysis also civil engineer ing as well as eco nomics and statistics, for the not irrelev a nt fractions of s ubmitted pro ducts that are cov er ed b y Thomso n Reuters data so urces. Finally , we p erfor med different t yp es of corr elation analy s is, including an intuitiv e probabilistic inv estigation. 5. Conclusion W e reca ll the resear ch questions p o sed in the introduction and we pr op ose answers based on the cur rent inv es tig ation o f the Ita lian resea rch sy s tem: 1. Ar e p e er r eview judgements and (article and jo urnal) bibliometric indic a- tors indep endent variables? Both article citation a nd jour nal impact ar e not independent from p eer review ass essment, but the cor relation is p ositive in b oth cas es: the higher 16 the p eer review opinion o n a pa p er , the higher the num b er of citations that the pa per and the publishing jour nal receive. F urthermore, the recently prop osed h index app ears to discriminate v ery well b etw een sets of pap ers assessed with differen t p eer judgements. It might b e a v ia ble indica tor of the impact of resear ch structures in the nex t editions o f the ev alua tio n exercise as so on as the average n umber of submitted pr o ducts per structure significantly incre a ses. 2. What is the str ength of the asso ciation? The corr elation strength b et ween peer a ssessment a nd bibliometric in- dicators is statistically sig nificant, a lthough not p erfect. Mor eov er , the strength o f the asso ciatio n v ar ie s across disciplines, and it dep ends also on the discipline internal cov erage of the used bibliometric database (the higher the dis c ipline cov erage, the hig her the reliability of cita tion mea - sures). Notwithstanding, the sk eptical has a t disposa l a few examples of pap ers that re c e ive a p ositive p eer judgement but do not c o llect a sig- nificant num b er of citations or that even s leep uncited (v an Ra an, 200 4). F urthermore, there are pap ers that obtain a p o or judgement from p eers but that r ally when citations are taken into account. Even more exc ep- tions a r e av ailable when comparing peer conclusions and impact factors o f journals. Nev er theless, using words o f Mo ed (200 5), a methodolog y , even if provides in v alid outcomes in individua l cases, ma y b e b eneficial to the scholarly sys tem as a w ho le. 3. Is the asso ciation b etwe en p e er judgement and article citation r ating signif- ic antly str onger than the asso ciation b etwe en p e er judgement and journal citation r ating? A somewha t surprising finding of the present investigation is that the dif- ference betw een the corr elation strengths of article citatio n and journal impact factor with resp ect to p eer assessment, although p erceiv a ble, is not as strong as one might expect. 7 It is w o rth no ticing tha t, during the ev aluatio n pro cess , p eer reviewers had a c cess to the impact factors of jour nals that published the ass e s sed pape r s, but they did not hav e enough information ab out the n umber of citations collected by the ev alu- ated pap ers, since most of these cita tio ns w ere not y et mature at the time of r eviewing. Therefor e, peer quality opinions cannot b e biased to ward highly cited pap ers a nd the asso ciation b etw e e n peer review and article citation is authentic. It is w orth observing that, as alre ady p ointed out by Asknes and T a x t (2004), pee r judgements a nd bibliometric p erforma nce measures can be exp ected to b e po sitively co r related only if the asp ects asses s ed by the p eer s corres po nd to tho s e reflected through bibliometric indicators. The notion o f quality assessed dur- ing p eer review is per ceived as a bro ad concept with different asp ects; some of these asp ects, but not necessarily all, are captured by bibliometric s. Moreover, 7 As noticed abov e, the j ournal impact factor is a measure of journal status and not of the impact of individual papers published in the j ournal. 17 different bibliometric measures r eflect different as p ects of qualit y , for instance, pro ductivity , p opularity , a nd prestige (F ranceschet, 2 010b). In summary , we fo und a compelling bo dy of evidence that judgements given by domain exp erts and biblio metric indicato r s ar e significantly p ositively cor- related. Therefor e, bibliometric indicators may be cons ide r ed as appr oximation me asur es of the inherent quality o f pape r s, which, how ever, remains fully ass ess- able only with a id o f human unbiased judgement , meditation, a nd elab oratio n. W e advocate the int egratio n of p eer review with biblio metric indicators, in par- ticular tho se directly related to the impact of individual articles, dur ing the next national a ssessment exerc is es. The co s t effectiveness of bibliometr ic ev a luation compared to that of p eer review would allow the ev a luation of a la rger sample of the univ erse under in vestigation without significan t incr ease of costs, which is a ma jor requirement due to the chronic na tional deficit and the pr e ssing ne- cessity of controlling public e x pense s in Italy . 8 This would allow a shift from the as sessment of research excelle nc e to a more balanced ev aluatio n of a verage resear ch perfo r mance. Larger s amples would, in turn, enhance the relia bility of bibliometric indicators. Ac knowledgemen ts The author s would like to thank CIVR and to its President, Pro f. F ranco Cuc- curullo, for making av ailable data used in this paper, thr o ugh the a greement proto col b etw een CIVR and PhD course in “Strumenti e meto di p er la v alu- tazione della Ricerca ” of the Univ er sity of Chieti-Pescara. References Abramo, G., D’Angelo , C. A., Caprase c c a, A., 2009 . Allo ca tive efficiency in public resear c h funding: can bibliometrics help? Resear c h p olicy 38 (1), 20 6– 215. Althouse, B. M., W es t, J. D., Bergstrom, C. T., B ergstrom, T., 2008 . Differences in impact factor across fields and ov er time. J ournal of the American So ciety for Information Science and T ec hno logy 60 (1 ), 27–3 4 . Asknes, D. W., T axt, R. E., 200 4. Peer reviews and bibliometric indica tors: a compara tive study at a No r wegian university . Research ev alua tion 13 (1), 33–41 . Ball, P ., 200 7 . Achiev emen t index climbs the ra nks. Nature 448, 737. Bleiklie, I., 1998. Justifying the ev a luative sta te: new public manag ement idea ls in higher educatio n. E ur op ean jour nal of education 33 (3), 29 9–318 . 8 T o b e sure, the cost of bibliometric ev aluation is low er than that of peer review; nonethe- less, ev ery exp erienced bibli ometrician knows that the cost to pro duce a reliable large-scale bibliometric assessment is far f rom null. 18 Bornmann, L., Daniel, H.-D., 2007a . Con vergen t v alidatio n of p eer r eview deci- sions using the h index: E x ten t of and r easons for t yp e I and type I I e rrors. Journal of Informetrics 1 (3), 204–213. Bornmann, L., Daniel, H.-D., 2007b. What do w e know ab out the h index? Jo ur- nal of the American So ciety fo r Information Science and T ec hno logy 5 8 (9), 1381– 1385. Calz` a, L., Garbisa, S., 1995. Ita lian professor ships. Nature 37 4, 4 92. Campb ell, P ., 200 8. Es cap e from the impact factor . E thics in science a nd envi- ronmental p olitics 8, 5– 7. Cronin, B., 199 6. Rates of retur n to citation. Jo urnal of Doc umentation 5 2 (2), 188–1 97. Cronin, B., 2 001. Hyp era uthorship: A po stmo dern per version or evide nce of a structural shift in scholarly communication practices. Jo urnal of the American So ciety for Information Science and T echnology 52 (7), 558 – 569. F ranceschet, M., 20 10a. A compariso n of bibliometric indica tors for computer science scholars and journals on W eb o f Science and Go ogle Scholar. Scient o- metrics. In press . F ranceschet, M., 2010b. The difference be tween p opular it y and pre stige in the sciences and in the social sciences : a bibliometric analysis. J ournal of Infor- metrics 4 (1), 55 –63. F ranceschet, M., 2 010c. The role of conference publications in computer science: a bibliometric view. Co mm unications of the ACM. In press. Garfield, E ., 20 06. The histo ry and meaning of the journal impact factor . Jo urnal of the American Medical Ass o ciation 2 95 (1), 90–9 3 . Garfield, E., Sher, H., 1963 . New factors in the ev alua tion of scientific literatur e through citation indexing. American Do cumentation 1 4, 195– 2 01. Hirsch, J. E ., 2005 . An index to q ua n tify an individual’s scientific resear ch output. P ro ceedings of the National Academy of Sciences of USA 10 2 (46), 16569 –1657 2. Minelli, E., Reb ora, G., T ur ri, M., 200 8. The structure and sig nificance of the Italian resea r ch ass essment exercise (VTR). In: Ma zza, C., Q uattrone, P ., Riccab oni, A. (Eds.), Europ ean Universities in T ransition. E dw ard Elgar P ub- lishing. Mo ed, H. F., 2005 . Citation Analysis in Research Ev aluation. Springer. Neav e, G., 19 98. The ev aluative state reco nsidered. Euro pea n journal of educa- tion 33 (3), 26 5 –285. 19 Opp enheim, C., Norris , M., 2 003. Citation counts and the research ass essment exercise V: a rchaeology a nd the 200 1 RAE. Journal of Documentation 5 6 (6), 709–7 30. Pendlebury , D. A., 2 009. The use and misuse o f jour nal metrics and other cita- tion indicators. Archivum Immunologiae et Thera piae Exp erimentalis 57 (1), 1–11. Reale, E., Barba r a, A., C o stantini, A., 2007 . Peer r e v iew for the ev a luation of academic res earch: lessons from the Italian exp erience. Res e arch ev alua tion 16 (3), 216– 228. Rinia, E. J., v an Leu ween, T. N., v an V uren, H. G., v a n Raan, A. F. J., 1998. Comparative analys is of a s e t of bibliometric indicator s a nd cen tr al p eer re- view criteria . E v aluation of condense d matter physics in the Netherlands. Research p olicy 27 (1), 9 5–107 . Seglen, P . O., 1992. The sk ewness of science. Journal of the American So ciety for Information Science 43 (9 ), 628–638 . v an Raan, A. F. J., 200 4. Sleeping b eauties in science. Scientometrics 59 (3), 467–4 72. v an Raan, A. F. J ., 2006. Compariso n o f the Hirsch-index with standar d biblio- metric indicators and with peer judgment for 14 7 chemistry r esearch gr oups. Scient ometrics 67 (3), 49 1–502 . 20 MCS rating size cites IF h E 284 (36%) 5.52 (1.39 ) 1.15 (1 .0 2) 16 (0.89) G 381 (48%) 3.31 (0.83 ) 1.10 (0 .9 8) 13 (0.72) A 101 (13%) 2.61 (0.66 ) 1.18 (1 .0 5) 7 (0.39) L 21 ( 3%) 2.18 (0.5 5) 0.91 (0.81) 3 (0.17) PHY rating size cites IF h E 914 (52%) 35 .30 (1 .43) 6.97 (1.20) 85 (0.9 8) G 676 (38%) 14 .19 (0 .58) 4.71 (0.81) 40 (0.4 6) A 158 (9%) 5.98 (0.2 4) 3.10 (0.54) 14 (0.16 ) L 19 (1%) 5.69 (0.23 ) 5.59 (0 .9 7) 7 (0.08) CHE rating size cites IF h E 342 (32%) 24 .72 (1 .53) 6.84 (1.32) 42 (0.8 4) G 513 (47%) 13 .54 (0 .84) 4.67 (0.91) 34 (0.6 8) A 200 (18%) 8.84 (0.55 ) 3.57 (0 .6 9) 18 (0.36) L 34 ( 3%) 7.57 (0.4 7) 3.47 (0.67) 9 (0.18) EAS rating size cites IF h E 220 (34%) 10 .37 (1 .42) 4.13 (1.37) 22 (0.8 5) G 324 (50%) 6.10 (0.83 ) 2.39 (0 .7 9) 18 (0.69) A 91 (14%) 4.12 (0.5 6) 2.60 (0.87) 9 (0.35) L 16 ( 2%) 6.37 (0.8 7) 2.16 (0.72) 4 (0.15) BIO rating size cites IF h E 519 (33%) 40 .86 (1 .66) 12.01 (1.42) 75 (0.90) G 802 (51%) 17 .97 (0 .73) 7.09 (0.84) 49 (0.5 9) A 222 (14%) 11 .59 (0 .47) 5.47 (0.64) 21 (0.2 5) L 32 ( 2%) 5.65 (0.2 3) 5.02 (0.59) 6 (0.07) T able 3: Pe er judgemen t and bibli ometric indicators (part I). rating : p eer review rating (E = Excellent, G = Go o d, A = Acceptable, L = Limited) size : num ber of pro ducts with the give n p eer rating (with p ercen tage with resp ect to all pro ducts), cites : av erage n umber of citations of articles with the gi ven p eer rating (with r atio with respect to the av erage ov er all articles), IF : av erage impact factor of j ournals of articles with the give n p eer rating (with ratio with resp ect to the a ve rage ov er all articles), h : h index of articles with the giv en p eer rating (with ratio wi th resp ect to the index ov er al l articles). 21 MED rating size cites IF h E 667 (25%) 47.72 (1.79) 11.73 (1.41) 89 (0.84) G 1314 (50%) 21.98 (0.82) 7.49 (0.90) 65 (0.61) A 492 (19%) 13.8 3 (0.52) 6.17 (0.74) 36 (0.34) L 166 ( 6%) 14 .7 2 (0.55) 7.67 (0.92) 21 (0.20) A VM rating size cites IF h E 76 (10%) 16.54 (2.02) 6.41 (2.41) 18 (0.67) G 393 (52%) 8.67 (1.0 6) 2.54 (0.96 ) 24 (0.89 ) A 218 (29%) 5.15 (0.62) 1.77 (0.6 6) 15 (0 .5 6) L 63 ( 9%) 3.21 (0.39) 1.28 (0.4 8) 6 (0.22 ) CEA rating size cites IF h E 166 (22%) 5.43 (1.5 2) 1.88 (1.62 ) 11 (0.79 ) G 329 (43%) 3.58 (1.0 0) 1.04 (0.90 ) 10 (0.71 ) A 217 (29%) 2.29 (0.64) 0.80 (0.7 0) 7 (0.50 ) L 46 ( 6%) 2.50 (0.70) 0.81 (0.7 0) 4 (0.29 ) I IE rating size cites IF h E 248 (21%) 7.16 (1.5 0) 2.03 (1.27 ) 19 (0.83 ) G 612 (51%) 4.57 (0.9 6) 1.56 (0.97 ) 18 (0.78 ) A 300 (25%) 3.18 (0.67) 1.33 (0.8 3) 11 (0 .4 9) L 35 ( 3%) 4.74 (0.99) 1.65 (1.0 3) 5 (0.22 ) ECS rating size cites IF h E 168 (17%) 5.55 (1.7 6) 1.31 (1.51 ) 14 (0.82 ) G 365 (38%) 2.77 (0.8 8) 0.75 (0.87 ) 11 (0.65 ) A 265 (27%) 1.06 (0.33) 0.58 (0.6 7) 4 (0.24 ) L 173 (18%) 0.67 (0 .2 1) 0.48 (0.56 ) 2 (0.12 ) T able 4: Pee r judgemen t and bibliometric i ndicators (part II). 22 MCS Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 27.3% 13.3% 27.3% 32.0% G 40.0% 15.7% 24.9% 19.4% A 48.9% 18.1% 21.3% 11.7% L 47.1% 35.3% 5.9% 11.8% PHY Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 16.1% 20.6% 25.7% 37.6% G 36.7% 29.2% 22.2% 11.9% A 63.4% 25.2% 9.2% 2.3% L 46.2% 53.8% 0.0% 0.0% CHE Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 14.3% 16.2% 25.9% 43.6% G 35.2% 22.8% 22.8% 19.8% A 42.4% 27.3% 23.8% 6.4% L 57.1% 17.9% 21.4% 3.6% EAS Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 17.0% 24.0% 21.5% 37.5% G 33.6% 27.0% 20.4% 19.1% A 45.1% 29.6% 12.7% 12.7% L 37.5% 12.5% 25.0% 25.0% BIO Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 11.3% 18.4% 25.3% 45.0% G 27.7% 28.7% 27.3% 16.3% A 50.2% 30.3% 10.9% 8.5% L 73.9% 13.0% 13.0% 0.0% T able 5: Cont ingency table displa ying the conditiona l distribution of article citation given peer r ating (part I). Pee r judgments are abbreviated as follows: E (Excellen t), G (Goo d), A (Accepta ble), L (Limited). 23 MED Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 12.3% 17.6% 25.3% 44.8% G 25.4% 25.2% 28.3% 21.1% A 43.4% 25.5% 19.4% 11.7% L 60.3% 17.8% 14.4% 7.5% A VM Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 8.3% 1 5.3% 2 6.4% 50.0% G 24.7% 25.2% 23.8% 26.3% A 38.4% 27.0% 22.2% 12.4% L 52.4% 26.2% 16.7% 4.8% CEA Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 18.5% 13.6% 30.9% 37.0% G 39.9% 14.0% 23.1% 23.1% A 54.1% 14.3% 21.4% 10.2% L 55.0% 15.0% 10.0% 20.0% I ID Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 25.5% 16.8% 25.0% 32.7% G 31.9% 24.1% 21.0% 23.0% A 44.6% 23.8% 17.1% 14.6% L 52.2% 17.4% 8.7% 21.7% ECS Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 14.7% 16.7% 27.3% 41.3% G 33.7% 21.5% 23.6% 21.1% A 54.6% 15.7% 21.3% 8.3% L 73.3% 6.7% 13.3% 6.7% T able 6: Cont ingency table displa ying the conditiona l distribution of article citation given peer rating (part I I). 24 MCS Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 20.7% 26.2% 27.0% 26.2% G 25.5% 26.6% 24.4% 23.5% A 34.7% 18.9% 22.1% 24.2% L 29.4% 41.2% 11.8% 17.6% PHY Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 15.9% 21.3% 28.4% 34.3% G 31.7% 31.2% 25.5% 11.5% A 54.2% 31.3% 9.2% 5.3% L 38.5% 38.5% 7.7% 15.4% CHE Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 8.4% 1 6.1% 3 4.7% 40.9% G 26.5% 32.9% 24.1% 16.5% A 48.3% 31.4% 18.6% 1.7% L 53.6% 25.0% 17.9% 3.6% EAS Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 16.5% 25.5% 22.5% 35.5% G 28.6% 21.7% 29.6% 20.1% A 45.2% 24.7% 17.8% 12.3% L 25.0% 37.5% 25.0% 12.5% BIO Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 7.5% 1 6.8% 2 6.1% 49.5% G 27.4% 31.0% 27.5% 14.1% A 56.2% 25.9% 11.9% 6.0% L 60.9% 21.7% 8.7% 8.7% T able 7: Cont ingency table displaying the conditional di stribution of j ournal impact factor give n p eer r ating (part I). P eer judgment s are abbreviated as follows: E (Excellen t), G (Go od), A (Acceptable), L (Limited). 25 MED Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 6.8% 1 7.3% 2 6.7% 49.2% G 24.0% 29.0% 29.0% 18.0% A 45.6% 29.4% 14.9% 10.2% L 51.0% 12.9% 12.2% 23.8% A VM Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 1.4% 2 .8% 19 .4% 76.4% G 15.8% 25.9% 31.6% 26.7% A 42.0% 33.5% 17.0% 7.4% L 70.5% 20.4% 9.1% 0.0% CEA Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 11.1% 7.4% 27.2% 54.3% G 25.0% 24.3% 32.6% 18.1% A 34.7% 38.8% 18.4% 8.2% L 38.1% 28.6% 23.8% 9.5% I ID Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 13.5% 24.0% 25.5% 37.0% G 25.5% 23.6% 26.3% 24.6% A 33.6% 29.5% 21.6% 15.4% L 39.1% 8.7% 26.1% 26.1% ECS Rating 1st Q .le 2nd Q.le 3rd Q.le 4th Q.le E 5.3% 1 4.0% 3 4.7% 46.0% G 27.2% 27.2% 27.2% 18.4% A 45.5% 32.7% 8.2% 13.6% L 53.3% 33.3% 6.7% 6.7% T able 8: Cont ingency table displaying the conditional di stribution of j ournal impact factor give n p eer rating (part I I). 26 MCS cites cites cites IF IF IF ratings > < = > < = E ∼ G 0.54 0.35 0.1 1 0.55 0.4 5 0.00 G ∼ A 0.49 0.3 6 0.15 0 .53 0.47 0.00 A ∼ L 0.40 0.4 1 0.19 0 .57 0.43 0.00 PHY cites cites cites IF IF IF ratings > < = > < = E ∼ G 0.69 0.29 0.0 2 0.66 0.32 0.02 G ∼ A 0.66 0.2 9 0.05 0 .66 0.33 0.01 A ∼ L 0.40 0.5 3 0.07 0 .36 0.64 0.00 CHE cites cites cites IF IF IF ratings > < = > < = E ∼ G 0.67 0.31 0.0 2 0.71 0.27 0.02 G ∼ A 0.57 0.3 9 0.04 0 .68 0.31 0.01 A ∼ L 0.53 0.4 1 0.06 0 .56 0.43 0.01 EAS cites cites cites IF IF IF ratings > < = > < = E ∼ G 0.61 0.33 0.0 6 0.60 0.38 0.02 G ∼ A 0.56 0.3 5 0.09 0 .61 0.38 0.01 A ∼ L 0.34 0.5 7 0.09 0 .41 0.57 0.02 BIO cites cites cites IF IF IF ratings > < = > < = E ∼ G 0.68 0.30 0.0 2 0.74 0.25 0.01 G ∼ A 0.64 0.3 3 0.03 0 .68 0.32 0.00 A ∼ L 0.59 0.3 5 0.06 0 .57 0.43 0.00 T able 9: Probability analysis of p eer judgment and bibliometric indicators (part I). F or each pair of adjacent p eer ratings, we compute probabilities P ( c ( X ) > c ( Y )), P ( c ( X ) < c ( Y )), P ( c ( X ) = c ( Y )) and P ( I F ( X ) > I F ( Y )), P ( I F ( X ) < I F ( Y )), P ( I F ( X ) = I F ( Y )). P eer judgmen ts are abbreviated as follows: E (Excellen t), G (Goo d), A (Accept able), L (Limi ted). 27 MED cites cites cites IF IF IF ratings > < = > < = E ∼ G 0.65 0.33 0.0 2 0.72 0.28 0.00 G ∼ A 0.61 0.3 6 0.03 0 .65 0.34 0.01 A ∼ L 0.60 0.3 5 0.05 0 .51 0.49 0.00 A VM cites cites cites IF IF IF ratings > < = > < = E ∼ G 0.67 0.29 0.0 4 0.84 0.16 0.00 G ∼ A 0.58 0.3 5 0.07 0 .72 0.28 0.00 A ∼ L 0.55 0.3 5 0.10 0 .69 0.31 0.00 CEA cites cites cites IF IF IF ratings > < = > < = E ∼ G 0.60 0.31 0.0 9 0.72 0.27 0.01 G ∼ A 0.54 0.3 1 0.15 0 .62 0.37 0.01 A ∼ L 0.39 0.4 1 0.20 0 .48 0.52 0.00 I IE cites cites cites IF IF IF ratings > < = > < = E ∼ G 0.54 0.38 0.0 8 0.59 0.41 0.00 G ∼ A 0.53 0.3 6 0.11 0 .69 0.41 0.00 A ∼ L 0.45 0.4 0 0.15 0 .44 0.56 0.00 ECS cites cites cites IF IF IF ratings > < = > < = E ∼ G 0.59 0.28 0.1 3 0.75 0.25 0.00 G ∼ A 0.50 0.2 5 0.25 0 .63 0.37 0.00 A ∼ L 0.37 0.2 0 0.43 0 .55 0.44 0.01 T able 10: Probability analysis of p eer judgmen t and bibliometric i ndicators (part II). 28

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment