Enhancing Retrieval-Augmented Generation with Topic-Enriched Embeddings: A Hybrid Approach Integrating Traditional NLP Techniques

Retrieval-Augmented Generation (RAG) systems rely on document retrieval to ground large language model outputs, yet retrieval quality often degrades in corpora where topics overlap and relevant evidence is distributed across long, heterogeneous texts. This paper proposes topic-enriched embeddings, a hybrid representation that integrates termfrequency signals (TF-IDF), dimensionality-reduced semantic structure (LSA), and probabilistic topic mixtures (LDA) into a unified vector space anchored by contextual sentence embeddings (all-MiniLM-L6-v2). The approach injects corpus-level thematic information into dense representations through two fusion strategies, concatenation and weighted averaging, while preserving computational tractability via latent-space compression. Empirical evaluation on a legal corpus of 12,436 documents related to Argentina’s Law 19.640 shows that topic enrichment improves both clustering coherence and retrieval effectiveness relative to statistical, probabilistic, and contextual-only baselines, with consistent gains in Precision@k, Recall@k, and F1. The results suggest that explicitly incorporating latent topic structure into embedding construction can reduce redundant or off-topic chunk retrieval, strengthening the evidential grounding of RAG pipelines in knowledge-intensive settings.

📜 Original Paper Content