Pointwise Mutual Information as a Performance Gauge for Retrieval-Augmented Generation
Recent work suggests that large language models enhanced with retrieval-augmented generation are easily influenced by the order, in which the retrieved documents are presented to the model when solving tasks such as question answering (QA). However, there is no method to date that exploits this phenomenon to improve generation. We fill this gap. In this study, we show that the pointwise mutual information between a context and a question is an effective gauge for language model performance. Importantly, this gauge does not depend on knowing the answer to the question a priori. Through experiments on two question-answering datasets and a variety of large language models, we find evidence for an empirical correlation between answer accuracy and pointwise mutual information. Additionally, we propose two methods that use the pointwise mutual information between a document and a question as a gauge for selecting and constructing prompts that lead to better performance, whose effectiveness we demonstrate through experimentation.
💡 Research Summary
The paper investigates how pointwise mutual information (PMI) between a question and a retrieved context can serve as a performance gauge for retrieval‑augmented generation (RAG) with large language models (LLMs). Recent work has shown that the order of retrieved documents strongly influences QA accuracy, but exploiting this effect required prior knowledge of the correct answer. The authors propose using PMI = log p(q | c) − log p(q) as a proxy that does not need the answer.
Theoretical contribution: Under the assumption that (i) when the model generates the correct answer the question is conditionally independent of the context given the answer, and (ii) when the model generates an incorrect answer the question is independent of the context, they prove that PMI equals the log‑odds of the correct answer up to an additive constant (Proposition 2.1). Hence higher PMI should correlate with higher probability of producing the correct answer.
Empirical setup: Experiments are conducted on two QA benchmarks—NQ‑Open (short factual answers) and ELI5 (long‑form answers). The authors evaluate several open‑source LLM families: LLaMA‑2, LLaMA‑3, LLaMA‑3.1, Mistral‑v0.3, and MPT. For each question they retrieve a set of K documents, permute them, and compute PMI for each permutation. Because enumerating all K! permutations is infeasible, they restrict to the cyclic group generated by a user‑specified permutation, yielding K representative orderings.
Results:
- Corpus‑level analysis shows a clear monotonic relationship: splitting instances into low, medium, and high PMI tertiles reveals that higher PMI bins achieve substantially higher accuracy (NQ‑Open) or sub‑claim recall (ELI5).
- Instance‑level analysis reproduces the U‑shaped curve reported by Liu et al. (2024): when the gold document is placed at the beginning or end of the context, both PMI and accuracy peak; placing it in the middle depresses both.
- Table 1 confirms that the permutation with maximal PMI yields the best average accuracy, while the minimal‑PMI permutation yields the worst.
Based on these observations, two prompt‑optimization methods are introduced:
- PMI‑max permutation – compute PMI for each candidate ordering and select the one with the highest value. This directly exploits the identified correlation.
- U‑shape efficient reordering – without exhaustive scoring, place the gold (or most relevant) document at the start or end and keep the remaining order unchanged. This cheap heuristic approximates the PMI‑max solution and consistently improves performance.
Both methods produce measurable gains across all models, with instruction‑tuned variants (e.g., LLaMA‑3‑Instruct, LLaMA‑3.1‑Instruct) benefiting the most (2–5 percentage‑point improvements).
The paper’s contributions are threefold: (1) establishing PMI as a reliable, answer‑agnostic gauge of RAG performance; (2) providing empirical evidence that maximizing PMI improves QA accuracy; (3) offering two practical, low‑overhead algorithms for prompt construction that can be readily integrated into existing retrieval‑augmented pipelines. The work opens avenues for automated prompt engineering and document reordering based on probabilistic information‑theoretic metrics, potentially extending to other downstream tasks beyond QA.
Comments & Academic Discussion
Loading comments...
Leave a Comment