CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, thereby reducing the document length fed into the generator, we introduce SCP, a key-preserving data synthesis framework based on question answering and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality. Experiments across multiple QA benchmarks show that CLaRa achieves state-of-the-art compression and reranking performance, even at a text compression rate of 16, outperforming text-based fine-tuned baselines.
💡 Research Summary
Retrieval‑augmented generation (RAG) has become a dominant paradigm for equipping large language models (LLMs) with external knowledge, yet two fundamental problems persist: (1) inefficiency caused by a mismatch between dense retrievers that operate on embeddings and generators that still consume raw text, leading to duplicated encoding, context overflow, and high inference cost; and (2) optimization disconnect because the retrieval step is discrete, preventing gradients from the generator’s loss from reaching the retriever. CLaRa (Continuous Latent Reasoning) addresses both issues by moving the entire pipeline into a shared continuous latent space.
The first component, SCP (Salient Compressor Pretraining), creates a high‑quality compression model that learns to retain only the most informative parts of a document. SCP synthesizes supervision by generating, for 2 M Wikipedia articles, (i) simple QA pairs that each target a single fact, (ii) complex QA pairs that combine multiple facts to encourage relational reasoning, and (iii) paraphrases that preserve meaning while altering surface form. An LLM (Qwen‑32B) produces these signals, and a verification loop checks factual consistency and coverage, iteratively adding missing QA pairs until a coverage threshold is met. The compressor is then trained with a cross‑entropy loss on these tasks together with an MSE alignment loss that forces the hidden states of the learned memory tokens to stay close to the averaged hidden states of the original tokens, ensuring semantic fidelity.
After SCP, each document is encoded once into a fixed set of learnable memory tokens (the “compressed document”). This representation is dramatically shorter than the original text, enabling offline indexing and eliminating redundant encoding at inference time.
CLaRa’s second stage jointly trains a query reasoner (θ_qr) and an answer generator (θ_g) using only a next‑token prediction loss. The query reasoner shares the same architecture and number of memory tokens as the compressor, so queries are embedded into the same latent space as the compressed documents. Cosine similarity between a query embedding q and each compressed document embedding M_i yields relevance scores s_i. To allow gradients to flow through the top‑k selection, CLaRa employs a Straight‑Through (ST) estimator: a softmax‑based soft selection Z_soft is computed, a hard arg‑max selection Z_hard is taken for the forward pass, and the final selector Z = Z_hard + (Z_soft – SG(Z_soft)) combines them, preserving discrete behavior while providing differentiable gradients. The top‑k document embeddings are concatenated with the query embedding and fed to the generator, which predicts the answer token‑by‑token. The language modeling loss L_CLaRa = –∑t log p_θg(a_t | q, M{1:k}, a_{<t}) is back‑propagated through both θ_qr and θ_g, effectively giving the retriever a weak supervision signal derived from answer quality without any explicit relevance labels.
The authors provide a theoretical justification showing that the next‑token loss yields valid gradients for the retriever and that the ST estimator reduces gradient variance compared with reinforcement‑learning based approaches such as DDR‑RAG.
Empirically, CLaRa is evaluated on four QA benchmarks (including NaturalQuestions, HotpotQA, TriviaQA, and WebQuestions) using Mistral‑7B and Phi‑4B as backbones. Even with a compression ratio of 16 (i.e., each document is reduced to 1/16 of its original token length), CLaRa outperforms strong baselines: standard text‑based RAG, dense retrieval‑only models, and recent compression‑only methods. Ablation studies confirm that SCP’s QA‑paraphrase supervision is crucial for preserving semantic content, and that the ST temperature and number of memory tokens affect the trade‑off between retrieval recall and generation accuracy.
Limitations include the need for large‑scale LLMs to generate the synthetic supervision for SCP and the offline nature of the compression step, which may be costly for rapidly changing corpora. Future work is suggested on multimodal documents, online compression updates, and lighter‑weight compressors.
In summary, CLaRa demonstrates that moving retrieval and generation into a unified continuous latent space, coupled with differentiable top‑k selection and a salient‑information‑preserving compressor, can substantially improve both efficiency and end‑to‑end optimization of retrieval‑augmented generation systems. This work points toward a new paradigm where LLMs interact with external knowledge through compact, trainable embeddings rather than raw text.
Comments & Academic Discussion
Loading comments...
Leave a Comment