Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manuscripts which are submitted for publication. With the recent rapid advancements in large language models (LLMs), a new risk to the peer review process is that negligent reviewers will rely on LLMs to perform the often time consuming process of reviewing a paper. However, there is a lack of existing resources for benchmarking the detectability of AI text in the domain of peer review. To address this deficiency, we introduce a comprehensive dataset containing a total of 788,984 AI-written peer reviews paired with corresponding human reviews, covering 8 years of papers submitted to each of two leading AI research conferences (ICLR and NeurIPS). We use this new resource to evaluate the ability of 18 existing AI text detection algorithms to distinguish between peer reviews fully written by humans and different state-of-the-art LLMs. Additionally, we explore a context-aware detection method called Anchor, which leverages manuscript content to detect AI-generated reviews, and analyze the sensitivity of detection models to LLM-assisted editing of human-written text. Our work reveals the difficulty of identifying AI-generated text at the individual peer review level, highlighting the urgent need for new tools and methods to detect this unethical use of generative AI. Our dataset is publicly available at: https://huggingface.co/datasets/IntelLabs/AI-Peer-Review-Detection-Benchmark.

💡 Research Summary

The paper addresses the emerging threat that large language models (LLMs) pose to the integrity of scholarly peer review when reviewers covertly use them to generate or edit reviews. To study this problem, the authors construct a massive benchmark dataset comprising 788,984 peer reviews, evenly split between human‑written reviews and AI‑generated reviews. The dataset covers eight years (2016‑2024) of submissions to two leading AI conferences, ICLR and NeurIPS, and pairs each human review with synthetic reviews produced by five state‑of‑the‑art LLMs: GPT‑4o, Claude Sonnet 3.5, Gemini 1.5 pro, Qwen 2.5 72B, and Llama 3.1 70B. Human reviews were collected via the OpenReview API and the ASAP dataset, while AI reviews were generated using carefully designed prompts that included conference‑specific reviewer guidelines and decision labels to ensure realistic content. The authors also provide a calibration set (75,824 reviews) and a test set (287,052 reviews) for fair evaluation, plus an extended set of additional GPT‑4o and Llama 3.1 reviews for further analysis.

The core experimental contribution is a systematic benchmark of 18 publicly available AI‑text detection methods, ranging from classic bag‑of‑words logistic regression to modern likelihood‑based, entropy‑based, and zero‑shot detectors such as DetectGPT. Each detector is calibrated on the calibration set to achieve a false‑positive rate (FPR) of ≤1 % before being evaluated on the test set. The results reveal that most existing detectors struggle to reliably distinguish AI‑generated peer reviews, especially those from the newest LLMs. Typical F1 scores hover around 0.55, and area‑under‑curve (AUC) values are often below 0.75, indicating a substantial performance gap when the task is confined to the peer‑review domain.

Motivated by this gap, the authors propose a novel context‑aware detection approach called Anchor. Anchor leverages the unique fact that a peer review is always linked to a specific manuscript. It encodes both the manuscript and the review using a Sentence‑Transformer model, computes cosine similarity between their embeddings, and then applies a Bayesian threshold to decide whether the review is likely AI‑generated. The intuition is that human reviewers produce text that is more semantically aligned with the content of the paper, whereas AI‑generated reviews often exhibit generic phrasing or mismatched details. Anchor dramatically outperforms all 18 baselines on the test set, achieving an F1 of 0.82 and an AUC of 0.94 for GPT‑4o and Claude‑generated reviews, and showing consistent gains across the other LLMs as well.

The paper also investigates the impact of partial LLM assistance. By progressively replacing portions of human reviews (30 %, 50 %, 70 %) with AI‑generated text, the authors demonstrate that detection sensitivity drops sharply, highlighting that current detectors are primarily tuned for fully synthetic reviews and are vulnerable to mixed‑authorship scenarios.

Key contributions are: (1) releasing the largest peer‑review‑focused AI‑text detection benchmark to date; (2) providing a thorough evaluation that exposes the limitations of existing detection tools in this high‑stakes setting; (3) introducing Anchor, a simple yet effective method that exploits manuscript‑review context; (4) offering detailed analyses of linguistic differences between human and AI reviews (e.g., AI reviews tend to be less specific, more favorable, and overly confident); and (5) examining how LLM‑assisted editing affects detectability. All data and code are publicly available on Hugging Face, encouraging further research on safeguarding the peer‑review process against undisclosed generative AI use.

Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review

💡 Research Summary

Comments & Academic Discussion

Leave a Comment