Training a Utility-based Retriever Through Shared Context Attribution for Retrieval-Augmented Language Models
Retrieval-Augmented Language Models boost task performance, owing to the retriever that provides external knowledge. Although crucial, the retriever primarily focuses on semantics relevance, which may not always be effective for generation. Thus, utility-based retrieval has emerged as a promising topic, prioritizing passages that provide valid benefits for downstream tasks. However, due to insufficient understanding, capturing passage utility accurately remains unexplored. This work proposes SCARLet, a framework for training utility-based retrievers in RALMs, which incorporates two key factors, multi-task generalization and inter-passage interaction. First, SCARLet constructs shared context on which training data for various tasks is synthesized. This mitigates semantic bias from context differences, allowing retrievers to focus on learning task-specific utility and generalize across tasks. Next, SCARLet uses a perturbation-based attribution method to estimate passage-level utility for shared context, which reflects interactions between passages and provides more accurate feedback. We evaluate our approach on ten datasets across various tasks, both in-domain and out-of-domain, showing that retrievers trained by SCARLet consistently improve the overall performance of RALMs.
💡 Research Summary
The paper introduces SCARLet, a novel framework for training utility‑oriented retrievers in Retrieval‑Augmented Language Models (RALMs). Traditional RALMs rely on retrievers that prioritize semantic relevance, which often misaligns with the downstream generation task’s actual needs. SCARLet tackles two overlooked challenges: (1) multi‑task generalization and (2) inter‑passage interaction.
First, SCARLet constructs a shared context that is common across all downstream tasks. Starting from seed data of multiple tasks, it extracts entities, expands them via Wikidata, and retrieves related Wikipedia passages. These passages form a single context D_shared. Using a large‑language‑model synthesizer, SCARLet then generates synthetic (input, ground‑truth) pairs for each task, conditioning on D_shared and the task’s instructions/examples. This approach eliminates the semantic bias introduced by the conventional pooling strategy where each training instance has its own context, allowing the retriever to focus on learning task‑specific utility rather than mere relevance. Noise passages are also deliberately inserted to improve robustness, and a filtering step removes faulty samples.
Second, SCARLet estimates passage‑level utility through a perturbation‑based attribution method. For a context containing k passages, a binary perturbation vector v ∈ {0,1}^k indicates which passages are kept or removed. Instead of enumerating all 2^k possibilities, SCARLet samples n random perturbations and measures the resulting change in the generator’s logits for the ground‑truth tokens (logit fluctuation). It then fits a ridge‑regression surrogate model:
α̂ = arg min_α Σ_i (z_i – αᵀ v_i)^2 + λ‖α‖²
where α_i represents the utility score of passage i and α_0 captures the intercept (overall context effect). The logit fluctuation z_i is computed as the sum of token‑wise logits for the correct answer. This surrogate captures not only the individual contribution of each passage but also the synergy among passages, because the perturbations reflect combinations of kept/removed passages. Experiments on the GTI benchmark (HotpotQA, Natural Questions, MSMARCO‑QA) show that the method achieves nDCG scores above 80 % (often >90 %), outperforming alternative attribution techniques by more than 20 %.
Third, SCARLet samples positive and negative passages based on the derived utility scores. The distribution of scores follows an inverse S‑shape; high‑scoring passages are treated as positive examples, low‑scoring ones as negative, and middle‑scoring passages are discarded. A one‑dimensional clustering (e.g., k‑means on the score axis) automatically determines the three clusters, adapting to varying numbers of useful passages across tasks. Using these labeled passages, the retriever (a dense encoder) is trained with a contrastive loss:
L = Σ_x Σ_{d⁺∈D⁺} Σ_{d⁻∈D⁻} max(0, margin – score(x,d⁺) + score(x,d⁻)) + λ·CE
where the cross‑entropy term regularizes the model.
The authors evaluate SCARLet on ten datasets covering eight distinct tasks (multi‑hop QA, long‑form QA, fact‑checking, dialogue, etc.), both in‑domain and out‑of‑domain. Across all benchmarks, RALMs equipped with SCARLet‑trained retrievers consistently outperform baselines that use semantic‑based dense retrievers or earlier utility‑based methods. Improvements are observed in standard metrics such as Exact Match, F1, BLEU, and ROUGE, with particularly strong gains on out‑of‑domain data, demonstrating better generalization. Ablation studies confirm that (a) shared‑context synthesis is crucial for multi‑task transfer, and (b) the perturbation‑based attribution yields more accurate utility signals than methods that ignore inter‑passage effects. Case studies illustrate that SCARLet correctly assigns high utility to passages containing essential facts while penalizing irrelevant or misleading passages.
In summary, SCARLet provides a complete pipeline: (1) shared‑context creation to neutralize task‑specific semantic bias, (2) perturbation‑surrogate attribution to capture fine‑grained, interaction‑aware utility, and (3) contrastive retriever training using utility‑derived positive/negative samples. This framework bridges the gap between retriever and generator, enabling RALMs to retrieve information that truly benefits downstream generation, and offers a plug‑and‑play improvement applicable to a wide range of language‑model‑driven applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment