Hybrid Pooling with LLMs via Relevance Context Learning

Hybrid Pooling with LLMs via Relevance Context Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

High-quality relevance judgements over large query sets are essential for evaluating Information Retrieval (IR) systems, yet manual annotation remains costly and time-consuming. Large Language Models (LLMs) have recently shown promise as automatic relevance assessors, but their reliability is still limited. Most existing approaches rely on zero-shot prompting or In-Context Learning (ICL) with a small number of labeled examples. However, standard ICL treats examples as independent instances and fails to explicitly capture the underlying relevance criteria of a topic, restricting its ability to generalize to unseen query-document pairs. To address this limitation, we introduce Relevance Context Learning (RCL), a novel framework that leverages human relevance judgements to explicitly model topic-specific relevance criteria. Rather than directly using labeled examples for in-context prediction, RCL first prompts an LLM (Instructor LLM) to analyze sets of judged query-document pairs and generate explicit narratives that describe what constitutes relevance for a given topic. These relevance narratives are then used as structured prompts to guide a second LLM (Assessor LLM) in producing relevance judgements. To evaluate RCL in a realistic data collection setting, we propose a hybrid pooling strategy in which a shallow depth-\textit{k} pool from participating systems is judged by human assessors, while the remaining documents are labeled by LLMs. Experimental results demonstrate that RCL substantially outperforms zero-shot prompting and consistently improves over standard ICL. Overall, our findings indicate that transforming relevance examples into explicit, context-aware relevance narratives is a more effective way of exploiting human judgements for LLM-based IR dataset construction.


💡 Research Summary

The paper addresses the costly and time‑consuming nature of building large‑scale relevance judgments (qrels) for information retrieval (IR) evaluation. While recent work has shown that large language models (LLMs) can serve as automatic relevance assessors, existing approaches—zero‑shot prompting or standard in‑context learning (ICL) with a few examples—suffer from limited generalization because they treat each example as an isolated query‑document pair and do not capture the underlying relevance criteria of a topic.

To overcome these limitations, the authors propose Relevance Context Learning (RCL), a two‑stage framework that explicitly models topic‑specific relevance criteria. First, a small set of human‑judged query‑document pairs (obtained from a shallow depth‑k pool) is fed to an “Instructor LLM”. This model analyzes the examples and generates a concise, natural‑language “relevance narrative” that describes what makes a document relevant for the given topic (e.g., required content, specificity, freshness, etc.). The narrative serves as a meta‑instruction rather than a collection of raw examples.

Second, the generated narrative is incorporated into a structured prompt for an “Assessor LLM”, which then judges the remaining documents in the pool. By conditioning the assessor on the narrative, the model can apply the same relevance criteria across many unseen documents, improving consistency and interpretability.

The authors embed RCL within a hybrid pooling strategy that mirrors realistic test‑collection construction. For each topic, they collect the top‑k documents from all participating systems. Human assessors label only the top‑k_human documents (e.g., the top‑3), providing high‑quality judgments for the most competitive results and supplying the data needed for the Instructor LLM. The rest of the pool (ranks > k_human) is automatically labeled by the Assessor LLM using the relevance narrative. This division reduces human effort dramatically while preserving critical human oversight at shallow ranks.

Experiments are conducted on three standard IR test collections: TREC Deep Learning 2019, TREC Deep Learning 2020, and TREC‑8. The LLM used for both instructor and assessor roles is Llama‑3.1‑8B‑Instruct, run through the high‑performance vLLM inference engine. Evaluation metrics include AP@1000, per‑query F1, and Matthews Correlation Coefficient (MCC). Baselines comprise zero‑shot prompting, random ICL (examples drawn without regard to relevance), and ICL limited to relevant examples only.

Results show that hybrid depth‑k pooling consistently outperforms a stratified‑sampling baseline across all collections. For DL‑19, depth‑k achieves an F1 of 0.891 versus 0.766 for stratified sampling with ICL‑Relevant (3 shots), a 16 % relative gain; similar improvements are observed for DL‑20 (≈12 %) and TREC‑8 (≈22 %). MCC improvements are also substantial (e.g., 19 % on DL‑19), indicating better discriminative power. Notably, on collections with long documents (TREC‑8), narrative‑based prompting avoids context‑window overflow that hampers example‑based ICL, leading to markedly higher performance.

Ablation experiments mixing narratives with example demonstrations reveal that pure narratives are more efficient than hybrid prompts, confirming that an explicit description of relevance criteria is more beneficial than supplying multiple raw examples. Cost analysis indicates that limiting human labeling to the shallow pool reduces total annotation expense by over 70 % while maintaining or improving label quality.

The paper acknowledges limitations: the quality of the generated relevance narrative depends on the diversity and representativeness of the human‑judged seed set; highly complex or multi‑intent topics may yield overly generic narratives that miss nuance. Future work is suggested to explore multi‑narrative aggregation, human‑in‑the‑loop verification of narratives, and extending RCL to multilingual or domain‑specific collections.

In summary, Relevance Context Learning offers a practical, cost‑effective pathway to construct high‑quality IR test collections by transforming a few human judgments into reusable, topic‑specific relevance narratives that guide LLMs in large‑scale automatic labeling. This approach bridges the gap between fully manual assessment and black‑box LLM predictions, delivering both scalability and interpretability for modern IR evaluation pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment