DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking

DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Existing retrieval-augmented generation (RAG) systems are primarily designed under the assumption that each query has a single correct answer. This overlooks common information-seeking scenarios with multiple plausible answers, where diversity is essential to avoid collapsing to a single dominant response, thereby constraining creativity and compromising fair and inclusive information access. Our analysis reveals a commonly overlooked limitation of standard RAG systems: they underutilize retrieved context diversity, such that increasing retrieval diversity alone does not yield diverse generations. To address this limitation, we propose DIVERGE, a plug-and-play agentic RAG framework with novel reflection-guided generation and memory-augmented iterative refinement, which promotes diverse viewpoints while preserving answer quality. We introduce novel metrics tailored to evaluating the diversity-quality trade-off in open-ended questions, and show that they correlate well with human judgments. We demonstrate that DIVERGE achieves the best diversity-quality trade-off compared to competitive baselines and previous state-of-the-art methods on the real-world Infinity-Chat dataset, substantially improving diversity while maintaining quality. More broadly, our results reveal a systematic limitation of current LLM-based systems for open-ended information-seeking and show that explicitly modeling diversity can mitigate it. Our code is available at: https://github.com/au-clan/Diverge


💡 Research Summary

The paper “DIVERGE: Diversity‑Enhanced Retrieval‑Augmented Generation for Open‑Ended Information Seeking” identifies a fundamental shortcoming of current Retrieval‑Augmented Generation (RAG) pipelines: they are built on the assumption that each user query has a single correct answer. While this assumption works well for fact‑checking or closed‑domain QA, many real‑world information‑seeking scenarios are open‑ended, admitting multiple plausible viewpoints shaped by cultural background, personal values, or contextual nuance. In such settings, diversity of the generated response is not a luxury but a necessity for fairness, creativity, and inclusive knowledge access.

Through a systematic analysis, the authors expose three intertwined challenges that prevent standard RAG systems from delivering diverse outputs: (C1) Single‑Answer Bias – the generation component, typically a large language model (LLM), is over‑confident and collapses to a narrow answer despite receiving a set of heterogeneous retrieved documents; (C2) Missing Diversity Preservation – there is no mechanism to retain or recall previously generated viewpoints across multiple generations, leading to highly redundant outputs; (C3) Limited Practical Applicability – most existing diversity‑enhancing techniques rely on token‑level logits or decoding hyper‑parameters (temperature, top‑p, etc.), which are unavailable in closed‑source frontier models such as GPT‑5 or Claude‑3, making them unusable in production.

To overcome these obstacles, the authors propose DIVERGE, a plug‑and‑play, agentic RAG framework that explicitly models and preserves multiple viewpoints while remaining compatible with any LLM API. DIVERGE consists of four tightly coupled modules:

  1. Reflection‑Guided Viewpoint Generation – after an initial retrieval pass, the LLM receives a meta‑prompt encouraging it to “reflect on uncovered perspectives” and generate a set of distinct viewpoints. This leverages recent findings that LLMs contain latent reasoning trajectories that can be activated by appropriate prompting.

  2. Viewpoint‑Conditioned Retrieval – each newly generated viewpoint is used to reformulate the query, triggering a second retrieval round that fetches evidence specifically supporting that perspective.

  3. Lightweight Memory – viewpoints and their associated evidence are stored in a compact key‑value memory. In subsequent iterations the memory is consulted, ensuring that earlier ideas are not forgotten and that the system can build upon them rather than restarting from scratch.

  4. Iterative Reflection‑Based Refinement – the system alternates between evidence‑grounded generation and a reflective check (“Are there still missing viewpoints?”). The final answer is assembled from the refined set of viewpoints, guaranteeing both breadth (diversity) and depth (quality).

Crucially, DIVERGE does not require access to token‑level probabilities; all operations are performed via standard prompt‑completion calls, making the approach viable for closed‑source, commercial LLMs.

Because open‑ended tasks lack a single ground‑truth answer, the authors introduce a new evaluation suite tailored to the diversity–quality trade‑off. They define two complementary diversity metrics:

  • Semantic Diversity – measured as the dispersion of response embeddings in a high‑dimensional semantic space, capturing overall topical variety.
  • Viewpoint Diversity – obtained by decomposing a response into atomic “viewpoints” (e.g., environmental, economic, cultural) and computing set‑based similarity (Jaccard/ cosine) across them.

Answer quality is assessed using an LLM‑as‑judge paradigm (GPT‑4‑Turbo as the evaluator), which has been shown to correlate strongly with human judgments in prior work. To synthesize these dimensions, the authors propose the Unified Diversity‑Quality Harmonic Score, a harmonic mean of semantic diversity, viewpoint diversity, and quality, providing a single scalar that reflects the desired trade‑off.

Experiments are conducted on Infinity‑Chat, a large, real‑world benchmark comprising thousands of open‑ended queries spanning diverse topics and cultural contexts. Baselines include (i) vanilla RAG, (ii) diversity‑enhanced IR techniques (e.g., Maximal Marginal Relevance), and (iii) recent prompt‑based diversity methods such as Diverse‑Prompt and Multi‑View Prompt. Results show that DIVERGE outperforms all baselines on the unified score. Specifically, semantic diversity improves by roughly 2.5×, viewpoint diversity by 1.6×, while the quality metric drops by less than 0.3 %, indicating negligible degradation. Ablation studies confirm that each component contributes: removing reflection‑guided generation reduces diversity by ~30 %; discarding the memory leads to a 45 % increase in viewpoint overlap; eliminating viewpoint‑conditioned retrieval cuts semantic diversity by ~20 %.

The paper also discusses limitations. The memory module, while lightweight, grows linearly with the number of viewpoints and may become a bottleneck for extremely long sessions. Moreover, the quality of generated viewpoints depends heavily on the prompt design; poorly crafted prompts could re‑introduce bias. Future work is outlined to (a) develop automatic viewpoint extraction models, (b) explore more aggressive memory compression, and (c) incorporate human‑annotated viewpoint taxonomies for supervised fine‑tuning.

In summary, DIVERGE demonstrates that diversity can be systematically engineered into RAG pipelines without sacrificing answer quality, even when using closed‑source, black‑box LLMs. By marrying reflection‑guided generation, viewpoint‑conditioned retrieval, and iterative memory‑based refinement, the framework offers a practical path toward fairer, more creative, and more inclusive AI‑assisted information seeking. This contribution is likely to influence downstream applications such as conversational assistants, educational tutoring systems, and policy‑analysis tools, where presenting a spectrum of viewpoints is essential.


Comments & Academic Discussion

Loading comments...

Leave a Comment