Allocate Marginal Reviews to Borderline Papers Using LLM Comparative Ranking

Allocate Marginal Reviews to Borderline Papers Using LLM Comparative Ranking
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper argues that large ML conferences should allocate marginal review capacity primarily to papers near the acceptance boundary, rather than spreading extra reviews via random or affinity-driven heuristics. We propose using LLM-based comparative ranking (via pairwise comparisons and a Bradley–Terry model) to identify a borderline band \emph{before} human reviewing and to allocate \emph{marginal} reviewer capacity at assignment time. Concretely, given a venue-specific minimum review target (e.g., 3 or 4), we use this signal to decide which papers receive one additional review (e.g., a 4th or 5th), without conditioning on any human reviews and without using LLM outputs for accept/reject. We provide a simple expected-impact calculation in terms of (i) the overlap between the predicted and true borderline sets ($ρ$) and (ii) the incremental value of an extra review near the boundary ($Δ$), and we provide retrospective proxies to estimate these quantities.


💡 Research Summary

The paper tackles the problem of how to allocate surplus reviewer capacity in large‑scale machine‑learning conferences. After the minimum required number of reviews per paper (r_min) is satisfied, many venues have a modest excess of review slots (s·N, where s is the average surplus per paper and N is the number of submissions). Current practice typically absorbs this surplus in load‑balancing, affinity‑based matching, or random assignment, which does not target the reviews where they are most valuable. The authors argue that the marginal value of an additional review is highest for papers that sit near the acceptance threshold, because score variance and decision sensitivity are greatest there. Consequently, they propose a policy that deliberately directs the surplus (“marginal”) reviews to a “borderline band” of papers identified before any human reviewer has read the submissions.

The core technical contribution is a lightweight pre‑review triage pipeline based on large‑language‑model (LLM) pairwise comparisons. Each paper is truncated to the first ten pages, and a structured prompt presents two papers (title, abstract, figure/table captions, and truncated text) to the LLM, asking it to choose the better one. The LLM returns a JSON field indicating the winner. By generating many random pairings (e.g., 40 rounds for 1,000 papers, yielding about 20,000 comparisons) the authors collect a binary win‑loss matrix. They fit a Bradley–Terry model to this data via maximum‑likelihood estimation, obtaining latent scores θ_i for each paper. Ordering papers by θ_i yields a global ranking without requiring calibrated absolute scores.

From this ranking the authors define a “borderline band” centered on the expected acceptance percentile (e.g., the top 25 % of papers). The band width w is set to match the total surplus capacity (w = s, expressed as a fraction of N). Papers whose ranking falls within this band are earmarked to receive one extra review (e.g., a fourth or fifth reviewer). The total number of reviews stays constant; only the distribution changes. This allocation can be implemented as a simple post‑matching adjustment by area chairs, fitting seamlessly into existing conference pipelines (e.g., the Toronto Paper Matching System or the two‑phase “2+2” process used at NeurIPS, ICML, and ICLR).

To quantify the expected benefit, the authors introduce two parameters. ρ (rho) is the overlap fraction between the LLM‑selected band and the true borderline set (as defined by final decisions and reviewer scores). Δ (delta) is the incremental decision‑reliability gain: the difference in the probability that an extra review flips a decision for a borderline paper versus a non‑borderline paper. Assuming each paper can receive at most one additional review, the expected number of net improved decisions is derived as:

 E


Comments & Academic Discussion

Loading comments...

Leave a Comment