Do they agree? Bibliometric evaluation vs informed peer review in the Italian research assessment exercise

During the Italian research assessment exercise, the national agency ANVUR performed an experiment to assess agreement between grades attributed to journal articles by informed peer review (IR) and by bibliometrics. A sample of articles was evaluated by using both methods and agreement was analyzed by weighted Cohen’s kappas. ANVUR presented results as indicating an overall ‘good’ or ‘more than adequate’ agreement. This paper re-examines the experiment results according to the available statistical guidelines for interpreting kappa values, by showing that the degree of agreement, always in the range 0.09-0.42 has to be interpreted, for all research fields, as unacceptable, poor or, in a few cases, as, at most, fair. The only notable exception, confirmed also by a statistical meta-analysis, was a moderate agreement for economics and statistics (Area 13) and its sub-fields. We show that the experiment protocol adopted in Area 13 was substantially modified with respect to all the other research fields, to the point that results for economics and statistics have to be considered as fatally flawed. The evidence of a poor agreement supports the conclusion that IR and bibliometrics do not produce similar results, and that the adoption of both methods in the Italian research assessment possibly introduced systematic and unknown biases in its final results. The conclusion reached by ANVUR must be reversed: the available evidence does not justify at all the joint use of IR and bibliometrics within the same research assessment exercise.

💡 Research Summary

The paper critically re‑examines the experiment carried out by Italy’s national research assessment agency (ANVUR) during the VQR (Valutazione della Qualità della Ricerca) to compare the outcomes of informed peer review (IR) and bibliometric evaluation for the same set of journal articles. ANVUR’s original report claimed an overall “good” or “more than adequate” level of agreement between the two methods, based on weighted Cohen’s kappa (κ) statistics calculated for each disciplinary area. The authors of the present study argue that this interpretation is inconsistent with widely accepted guidelines for κ values. According to the standards commonly used in the statistical literature, κ values between 0.01 and 0.20 indicate “poor” agreement, 0.21‑0.40 “fair,” 0.41‑0.60 “moderate,” 0.61‑0.80 “substantial,” and 0.81‑1.00 “almost perfect.”

When the raw κ values reported by ANVUR are examined, they range only from 0.09 to 0.42 across all fields. In the humanities and social sciences (Areas 1‑5) and the natural sciences (Areas 6‑12) the κ values cluster between 0.09 and 0.30, which, by the accepted scale, correspond to “poor” or at best “fair” agreement. The only discipline that approaches the “moderate” threshold is economics and statistics (Area 13), where κ reaches about 0.41. However, the authors demonstrate that the evaluation protocol used for Area 13 differed substantially from the protocol applied in the other areas. Specifically, reviewers in Area 13 were given prior access to the bibliometric scores and were instructed to incorporate that information into their peer‑review judgments. This procedural bias likely inflated the κ value for that area, making the apparent higher agreement an artifact of the design rather than a genuine convergence of the two assessment methods.

To strengthen their argument, the authors conduct a meta‑analysis of the κ values across all areas, calculating heterogeneity statistics (Q and I²). The results reveal high heterogeneity, confirming that agreement levels differ significantly among fields and that the overall average κ does not represent a uniform “good” agreement. The meta‑analytic findings also show that the pooled κ remains well below the “moderate” threshold, reinforcing the conclusion that IR and bibliometrics do not produce comparable results in the VQR context.

The paper discusses the practical implications of these findings for the Italian research assessment exercise. By using both IR and bibliometrics in parallel, ANVUR introduced the possibility that the same article could receive divergent grades depending on the method applied, potentially leading to systematic and unrecognised biases in the final allocation of funding, career advancement, and institutional rankings. The authors argue that the claim of “good” agreement is unsupported by the statistical evidence and that the joint use of the two methods in a single assessment framework is methodologically unsound.

In conclusion, the study calls for a complete reassessment of ANVUR’s conclusions. It recommends that future research assessments either rely on a single, well‑validated method or, if a mixed approach is desired, implement rigorous validation procedures, transparent protocols, and independent checks to ensure that the two methods truly converge. Only by adhering to such standards can research evaluation be both fair and reliable, avoiding the introduction of hidden biases that could distort national research policy and the distribution of public resources.

💡 Research Summary

📜 Original Paper Content