Research quality evaluation by AI in the era of large language models: advantages, disadvantages, and systemic effects – An opinion paper

Artificial Intelligence (AI) technologies like ChatGPT now threaten bibliometrics as the primary generators of research quality indicators. They are already used in at least one research quality evaluation system and evidence suggests that they are used informally by many peer reviewers. Since harnessing bibliometrics to support research evaluation continues to be controversial, this article reviews the corresponding advantages and disadvantages of AI-generated quality scores. From a technical perspective, generative AI based on Large Language Models (LLMs) equals or surpasses bibliometrics in most important dimensions, including accuracy (mostly higher correlations with human scores), and coverage (more fields, more recent years) and may reflect more research quality dimensions. Like bibliometrics, current LLMs do not “measure” research quality, however. On the clearly negative side, LLM biases are currently unknown for research evaluation, and LLM scores are less transparent than citation counts. From a systemic perspective, a key issue is how introducing LLM-based indicators into research evaluation will change the behaviour of researchers. Whilst bibliometrics encourage some authors to target journals with high impact factors or to try to write highly cited work, LLM-based indicators may push them towards writing misleading abstracts and overselling their work in the hope of impressing the AI. Moreover, if AI-generated journal indicators replace impact factors, then this would encourage journals to allow authors to oversell their work in abstracts, threatening the integrity of the academic record.

💡 Research Summary

The opinion paper examines how large language model (LLM)–based artificial intelligence, exemplified by ChatGPT, is poised to reshape research quality evaluation, a domain traditionally dominated by bibliometrics. The authors begin by noting that AI tools are already embedded in at least one formal evaluation system and are informally used by many peer reviewers, signaling a shift from citation‑based metrics toward algorithmic assessment. From a technical standpoint, the paper argues that LLMs match or exceed bibliometric indicators on several key dimensions. First, accuracy: empirical studies cited in the article show higher correlations between LLM‑generated scores and human expert judgments than between citation counts and expert judgments, suggesting that LLMs capture aspects of quality that citations miss. Second, coverage: LLMs can process recent publications, non‑English literature, and emerging fields that are under‑represented in citation databases, thereby expanding the evaluative horizon. Third, dimensionality: because LLMs analyze the full text of papers—abstracts, introductions, methods, and conclusions—they can infer latent qualities such as methodological rigor, novelty, interdisciplinary relevance, and societal impact, which are not directly reflected in citation tallies.

Despite these advantages, the authors caution that LLMs do not “measure” quality in a direct, observable way; they predict human judgments based on patterns learned from training data. Consequently, the sources of bias in LLM‑based scores remain largely unknown. The opacity of model architectures and training corpora makes it difficult to trace why a particular paper receives a high or low score, whereas citation counts are transparent and reproducible. This lack of transparency raises concerns about accountability and fairness.

The paper then shifts to a systemic perspective, exploring how the introduction of LLM‑derived indicators could alter researcher behavior. Bibliometric incentives have already encouraged “impact‑factor chasing” and strategic citation practices. Analogously, LLM incentives might push authors to craft overly sensational abstracts, embed buzzwords, or otherwise “game” the language model to achieve higher scores. Journals, in turn, could feel pressure to relax editorial standards or to allow authors to oversell their contributions, potentially eroding the integrity of the scholarly record.

To mitigate these risks, the authors propose several safeguards: (1) open‑source disclosure of model weights and training data to enable independent bias audits; (2) hybrid evaluation frameworks that combine LLM scores with traditional bibliometrics and human expert review; (3) development of clear ethical guidelines and institutional policies governing AI‑assisted assessment; and (4) ongoing research into the interpretability of LLM outputs, ensuring that stakeholders can understand the basis for any given score.

In conclusion, the paper acknowledges that LLMs hold transformative potential for research quality evaluation—offering higher accuracy, broader coverage, and richer multidimensional insight—but stresses that their deployment must be accompanied by rigorous transparency, bias mitigation, and policy measures to prevent unintended behavioral shifts and to preserve the credibility of academic publishing.

💡 Research Summary

📜 Original Paper Content