OmniReview: A Large-scale Benchmark and LLM-enhanced Framework for Realistic Reviewer Recommendation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Academic peer review remains the cornerstone of scholarly validation, yet the field faces some challenges in data and methods. From the data perspective, existing research is hindered by the scarcity of large-scale, verified benchmarks and oversimplified evaluation metrics that fail to reflect real-world editorial workflows. To bridge this gap, we present OmniReview, a comprehensive dataset constructed by integrating multi-source academic platforms encompassing comprehensive scholarly profiles through the disambiguation pipeline, yielding 202, 756 verified review records. Based on this data, we introduce a three-tier hierarchical evaluaion framework to assess recommendations from recall to precise expert identification. From the method perspective, existing embedding-based approaches suffer from the information bottleneck of semantic compression and limited interpretability. To resolve these method limitations, we propose Profiling Scholars with Multi-gate Mixture-of-Experts (Pro-MMoE), a novel framework that synergizes Large Language Models (LLMs) with Multi-task Learning. Specifically, it utilizes LLM-generated semantic profiles to preserve fine-grained expertise nuances and interpretability, while employing a Task-Adaptive MMoE architecture to dynamically balance conflicting evaluation goals. Comprehensive experiments demonstrate that Pro-MMoE achieves state-of-the-art performance across six of seven metrics, establishing a new benchmark for realistic reviewer recommendation.

💡 Research Summary

OmniReview tackles two long‑standing bottlenecks in reviewer recommendation: the lack of a large, high‑quality benchmark that mirrors real editorial practice, and the methodological limits of existing embedding‑based models. The authors first construct a massive, verified dataset by merging three authoritative sources—Open Academic Graph (OAG), the Frontiers open‑access platform, and ORCID public data. A rigorous pipeline (data cleaning, cross‑source publication matching, scholar disambiguation, and verification) yields 202,756 review records involving 150,287 distinct reviewers. Unlike prior benchmarks that rely on sparse historical assignments, OmniReview enriches ground‑truth by generating dense relevance labels: papers and scholars are organized into a hierarchical taxonomy, semantic matching identifies potential experts beyond the historical pool, and an h‑index filter guarantees a baseline of scholarly quality. This process mitigates false‑negative bias and produces a more realistic candidate set.

To evaluate systems in a way that reflects editorial needs, the authors propose a three‑tier hierarchical framework. Task 1 (Recall) measures the ability to retrieve historically assigned reviewers; Task 2 (Discrimination) tests whether a model can filter out “hard negatives” that share superficial keywords but lack true domain expertise; Task 3 (Ranking) assesses fine‑grained ordering of the most suitable experts among the qualified pool. Each task uses appropriate metrics (recall, precision, NDCG, etc.), allowing a nuanced diagnosis of a system’s strengths and weaknesses.

Methodologically, the paper identifies two core flaws of current embedding‑based approaches: (1) semantic compression that erodes fine‑grained distinctions between sub‑fields, and (2) a black‑box nature that offers little interpretability for editors. To overcome these, the authors introduce Pro‑MMoE (Profiling Scholars with Multi‑gate Mixture‑of‑Experts). The first stage leverages a large language model (LLM) to generate “semantic profiles” for each reviewer. Instead of collapsing a scholar’s publication history into a static vector, the LLM is prompted to extract and summarize research interests into concise, human‑readable text. This preserves nuanced expertise and provides explicit evidence that editors can inspect.

The second stage employs a Task‑Adaptive Multi‑gate Mixture‑of‑Experts architecture. Input features combine the LLM‑generated textual embeddings with traditional metadata (paper abstracts, keywords, citation graphs). Several expert subnetworks learn shared representations, while task‑specific experts focus on the distinct objectives of Recall, Discrimination, and Ranking. Gating networks dynamically allocate weight to each expert per task, thereby balancing the trade‑off between broad candidate coverage and precise expert discrimination within a unified model. The overall loss is a weighted sum of task‑specific losses, allowing flexible emphasis during training.

Extensive experiments on the OmniReview benchmark compare Pro‑MMoE against strong baselines: TF‑IDF, Doc2Vec, GraphSAGE, single‑objective MMoE, and recent LLM‑enhanced recommenders. Pro‑MMoE outperforms all baselines on six of seven metrics, achieving absolute gains of 1.02 % (Recall), 5.39 % (Discrimination), and a striking 17.15 % (Ranking). The biggest improvement appears in the Discrimination task, confirming that LLM‑derived profiles effectively separate true experts from superficially similar but unsuitable candidates. Moreover, the textual profiles afford interpretability: editors receive a natural‑language justification such as “Reviewer X has recent publications on Y and Z, matching 85 % of the submission’s keywords.”

The authors acknowledge limitations: LLM inference incurs computational cost and may introduce its own biases; the weighting of task losses remains heuristic; and real‑time deployment would require caching or more efficient LLM variants. Future work is outlined to explore cost‑effective LLM fine‑tuning, automated validation of generated profiles, meta‑learning for dynamic loss weighting, and extensions to other expert‑matching domains (e.g., patent examination, legal peer review).

In summary, OmniReview delivers the most comprehensive, verified reviewer‑recommendation dataset to date and proposes a novel LLM‑MMoE hybrid model that simultaneously enhances representation granularity, interpretability, and multi‑objective optimization. This work sets a new standard for realistic reviewer recommendation research and provides a solid foundation for subsequent advances in scholarly workflow automation.

OmniReview: A Large-scale Benchmark and LLM-enhanced Framework for Realistic Reviewer Recommendation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment