TCDE: Topic-Centric Dual Expansion of Queries and Documents with Large Language Models for Information Retrieval
Query Expansion (QE) enriches queries and Document Expansion (DE) enriches documents, and these two techniques are often applied separately. However, such separate application may lead to semantic misalignment between the expanded queries (or documents) and their relevant documents (or queries). To address this serious issue, we propose TCDE, a dual expansion strategy that leverages large language models (LLMs) for topic-centric enrichment on both queries and documents. In TCDE, we design two distinct prompt templates for processing each query and document. On the query side, an LLM is guided to identify distinct sub-topics within each query and generate a focused pseudo-document for each sub-topic. On the document side, an LLM is guided to distill each document into a set of core topic sentences. The resulting outputs are used to expand the original query and document. This topic-centric dual expansion process establishes semantic bridges between queries and their relevant documents, enabling better alignment for downstream retrieval models. Experiments on two challenging benchmarks, TREC Deep Learning and BEIR, demonstrate that TCDE achieves substantial improvements over strong state-of-the-art expansion baselines. In particular, on dense retrieval tasks, it outperforms several state-of-the-art methods, with a relative improvement of 2.8% in NDCG@10 on the SciFact dataset. Experimental results validate the effectiveness of our topic-centric and dual expansion strategy.
💡 Research Summary
The paper introduces TCDE (Topic‑Centric Dual Expansion), a novel, training‑free framework that simultaneously expands queries and documents using large language models (LLMs) in a topic‑centric manner. Traditional query expansion (QE) and document expansion (DE) techniques are typically applied independently, which can cause semantic misalignment between the enriched queries and the documents they aim to retrieve. TCDE addresses this by converting both queries and documents into a shared abstract representation—“topics”—thereby aligning them at both lexical and semantic levels.
Methodology
TCDE consists of two symmetric components: Topic‑Centric Query Expansion (TQE) and Topic‑Centric Document Expansion (TDE).
- TQE: For a given user query q, an LLM is prompted to identify N distinct sub‑topics and to generate a short pseudo‑document for each sub‑topic. The set of generated pseudo‑documents, D_Topic = {d_t1,…,d_tN}, is concatenated with the original query repeated five times (to preserve the original intent) to form the expanded query q⁺. This design injects diverse, topic‑focused content while keeping the core intent dominant.
- TDE: For each corpus document d, an LLM is prompted to extract N concise topic sentences, each summarizing a core aspect of the document. The set S_Topic = {s₁,…,s_N} is appended to the original document, yielding the expanded document d⁺. By summarizing rather than synthesizing new content, TDE avoids topic drift while providing explicit topic cues that mirror the query‑side expansions.
Both expansions use the same N to maintain symmetry, and the resulting q⁺ and d⁺ are indexed and matched using standard retrieval pipelines. The authors formalize the alignment effect: after expansion, the similarity score S(q⁺, d⁺_pos) for a relevant pair increases, while S(q⁺, d⁺_neg) for an irrelevant pair decreases, compared with the original scores. This is demonstrated both analytically (Equation 5) and empirically via t‑SNE visualizations.
Experimental Setup
The authors evaluate TCDE on a broad suite of benchmarks: three web‑search datasets (MS MARCO, TREC DL 2019, TREC DL 2020) and the zero‑shot BEIR benchmark covering domains such as scientific literature, biomedical texts, and general knowledge (e.g., SciFact, FEVER, NaturalQuestions). Both sparse retrieval (BM25) and dense retrieval (DPR, ColBERT) are tested. Baselines include classic PRF methods, recent LLM‑based QE approaches (HyDE, Query2Doc, GRF), and state‑of‑the‑art DE methods (Doc2Query, docT5query, Ma et al.’s LLM‑based DE).
Results
Across all datasets, TCDE consistently outperforms baselines. Notably, on the SciFact dataset TCDE achieves a 2.8 % absolute gain in NDCG@10 over the best prior method, highlighting its effectiveness for scientific queries. Improvements are observed in both sparse and dense settings, confirming that the topic‑centric expansions benefit lexical matching (more keyword overlap) and semantic matching (better embedding proximity). Ablation studies reveal that (1) the number of topics N (typically 5) balances coverage and noise, (2) repeating the original query five times is crucial for preserving intent, and (3) using concise topic sentences for documents mitigates hallucination while still providing alignment cues.
Analysis and Limitations
TCDE’s main strength lies in its explicit alignment mechanism: by forcing both sides to speak the same “topic language,” it reduces the semantic gap that plagues asymmetric expansion. The framework is training‑free, requiring only prompt engineering and LLM inference, which simplifies deployment. However, the approach inherits the high computational cost of LLM calls, and performance can be sensitive to prompt wording and the chosen LLM (e.g., GPT‑3.5 vs. Claude). The selection of N is dataset‑dependent; an adaptive or learned N could further improve robustness. Moreover, the current experiments focus on English; extending to multilingual settings would require careful prompt translation and evaluation.
Conclusion
TCDE presents a compelling solution to the longstanding vocabulary mismatch problem by jointly expanding queries and documents in a topic‑centric fashion. Its dual expansion strategy yields measurable gains across diverse domains and retrieval paradigms, demonstrating that semantic alignment can be engineered directly through LLM‑driven topic generation and summarization. Future work may explore cost‑effective LLM alternatives, dynamic topic number selection, and multilingual adaptations, paving the way for broader adoption of dual expansion in real‑world search systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment