ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing
Given the rapid ascent of large language models (LLMs), we study the question: (How) can large language models help in reviewing of scientific papers or proposals? We first conduct some pilot studies where we find that (i) GPT-4 outperforms other LLMs (Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly, OpenAssistant, StableLM), and (ii) prompting with a specific question (e.g., to identify errors) outperforms prompting to simply write a review. With these insights, we study the use of LLMs (specifically, GPT-4) for three tasks: 1. Identifying errors: We construct 13 short computer science papers each with a deliberately inserted error, and ask the LLM to check for the correctness of these papers. We observe that the LLM finds errors in 7 of them, spanning both mathematical and conceptual errors. 2. Verifying checklists: We task the LLM to verify 16 closed-ended checklist questions in the respective sections of 15 NeurIPS 2022 papers. We find that across 119 {checklist question, paper} pairs, the LLM had an 86.6% accuracy. 3. Choosing the"better"paper: We generate 10 pairs of abstracts, deliberately designing each pair in such a way that one abstract was clearly superior than the other. The LLM, however, struggled to discern these relatively straightforward distinctions accurately, committing errors in its evaluations for 6 out of the 10 pairs. Based on these experiments, we think that LLMs have a promising use as reviewing assistants for specific reviewing tasks, but not (yet) for complete evaluations of papers or proposals.
💡 Research Summary
The paper “ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing” investigates whether large language models (LLMs), specifically GPT‑4, can assist in the peer‑review process for scientific papers and proposals. The authors begin with a pilot comparison of eight LLMs (GPT‑4, Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly, OpenAssistant, StableLM). GPT‑4 consistently outperforms the others, and prompting the model with a concrete task (“identify errors”) yields better results than a generic request to “write a review.” Guided by these findings, the study focuses on three concrete reviewing tasks.
-
Error Identification – Thirteen short computer‑science papers were deliberately seeded with either mathematical mistakes (incorrect proofs, faulty derivations) or conceptual flaws (mis‑stated assumptions). GPT‑4 was asked to verify each paper’s correctness. The model successfully detected errors in seven papers, correctly pinpointing both types of mistakes. In the remaining six papers, it missed the inserted errors or produced false positives, indicating that while GPT‑4 can parse context well, its ability to perform deep mathematical verification is limited.
-
Checklist Verification – The authors selected fifteen NeurIPS 2022 papers and generated sixteen closed‑ended checklist items per paper (e.g., “Is the dataset publicly available?”, “Is the code released?”, “Has an ethics review been performed?”). Across 119 (question, paper) pairs, GPT‑4 achieved an overall accuracy of 86.6 %. Accuracy was highest on purely factual items (data/code availability) and lower on subjective items such as “potential societal impact,” suggesting that LLMs excel when the answer is objective but struggle with nuanced, interpretive judgments.
-
Choosing the “Better” Paper – Ten pairs of abstracts were crafted so that one abstract was clearly superior in terms of scientific contribution, clarity, and novelty. GPT‑4 was asked to select the better abstract in each pair. The model made incorrect choices in six out of ten cases, often over‑weighting superficial linguistic features (e.g., fluency, vocabulary richness) while neglecting deeper aspects such as methodological rigor or originality. This result highlights a fundamental limitation: current LLMs do not share the human notion of “paper quality” and rely heavily on surface text cues.
The authors conclude that LLMs, particularly GPT‑4, show promise as reviewing assistants for narrowly defined tasks such as error detection and checklist verification. In these contexts, they can reduce reviewer workload, provide rapid, reproducible feedback, and maintain high accuracy on objective queries. However, the study also demonstrates that LLMs are not yet suitable for full‑paper evaluations that require holistic judgment, assessment of novelty, or ethical considerations.
To move toward more useful integration, the paper proposes three future directions: (1) develop domain‑specific prompting strategies and fine‑tune models on scholarly corpora to improve depth of understanding; (2) design hybrid workflows where LLMs handle repetitive, fact‑based checks while human reviewers focus on high‑level critique; and (3) incorporate meta‑cognitive mechanisms that allow the model to flag uncertainty and defer to humans when confidence is low. By outlining both the capabilities and the current shortcomings, the study provides a realistic roadmap for incorporating LLMs into the peer‑review ecosystem, emphasizing that they should augment—not replace—human expertise at this stage.
Comments & Academic Discussion
Loading comments...
Leave a Comment