Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non-parametric Retrieval-Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual-model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model’s embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long-context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code is available at https://github.com/Lancelot-Xie/DRIFT.

💡 Research Summary

The paper introduces DRIFT, a dual‑model framework that explicitly separates knowledge extraction from reasoning to enable efficient long‑context inference in large language models (LLMs). The system consists of a lightweight “knowledge” model (ψₖₙₒ) and a larger “reasoning” model (θᵣₑₐ). Given a query Q and a long document X, the document is first split into semantically coherent chunks using a recursive splitter. Each chunk Cⱼ is processed in parallel by ψₖₙₒ together with a fixed number of special <|CPS|> tokens; the model’s last‑layer hidden states for these tokens become a set of dense vectors Tⱼ, called implicit fact tokens. These tokens are concatenated across all chunks and passed through a three‑layer MLP projector π, which aligns them with the embedding space of the reasoning model, producing implicit fact embeddings E. The reasoning model receives E together with the query embedding E(Q) and generates the final answer, thereby avoiding the need to ingest the raw, often redundant, text.

A key technical contribution is bucketed compression. Instead of a fixed compression ratio (e.g., 8:1 for every input), the authors define token‑length buckets (e.g., 64‑128, 128‑256 tokens). For any input length n, the output token count is set to the upper bound of the bucket divided by a base compression factor. This dynamic allocation reflects the reality that informative content is unevenly distributed: a few critical sentences may carry most of the answer‑relevant information, while large portions are filler. Bucketed compression thus preserves more tokens for dense information regions and reduces waste in sparse regions, improving robustness especially under extreme compression ratios.

Training proceeds in three stages:

Latent Fact Reconstruction Pre‑training (LFRP) – ψₖₙₒ is trained to compress documents with a static ratio (cₛₜₐ = 8) into latent fact tokens. The reasoning model is frozen and used as a decoder to reconstruct the original document from the projected embeddings. Gradients flow only through ψₖₙₒ and π, teaching the knowledge model to produce representations that are maximally useful for reconstruction.
Query‑Aware Fine‑Tuning – Dynamic Compression (QAFT‑DC) – A query‑conditioned compression instruction is added, allowing ψₖₙₒ to selectively encode information relevant to Q. The loss combines reconstruction error and a query‑answer supervision term, encouraging dynamic allocation of tokens per bucket while still preserving enough detail for downstream tasks.
Query‑Aware Fine‑Tuning – Answer Generation (QAFT‑QA) – Finally, θᵣₑₐ is fine‑tuned to generate answers directly from the implicit fact embeddings, completing the end‑to‑end pipeline.

The authors built a large‑scale Document‑QA‑Evidence dataset (≈300 K instances, documents 1K–8K tokens) with fine‑grained evidence annotations to train and evaluate the system. Experiments on LongBench v2, multi‑document QA, and fact‑verification benchmarks demonstrate that DRIFT dramatically improves both accuracy and efficiency. Using a Mistral‑7B reasoning backbone, DRIFT achieves a 32× compression while raising the LongBench score from 20.87 % to 29.22 % (an 8.35 % absolute gain). Even at more aggressive 64× and 128× compression, performance remains competitive. In terms of speed, processing a 256 k‑token document yields an average 7× inference speedup and a 30 % reduction in GPU memory consumption compared to feeding the raw text into the same reasoning model.

Ablation studies confirm the importance of bucketed compression (fixed‑ratio baselines lose up to 4 % accuracy), the three‑layer projector (deeper projectors improve alignment), and the size of ψₖₙₒ (tiny models still work but larger ones give smoother token allocation). Comparisons with prior soft‑compression methods such as COCOM, xRAG, and E2LLM show that DRIFT’s query‑conditioned compression avoids the information loss typical of static latent representations and eliminates the dependence on a separate retriever.

Limitations are acknowledged: the current chunking relies on a deterministic text splitter, which may be suboptimal for highly unstructured data (code, tables). The interface between the two models is fixed; extending DRIFT to multimodal inputs or to other reasoning model architectures would require retraining the projector. Extremely high compression (>256×) still leads to noticeable degradation, suggesting future work on adaptive token budgeting or hierarchical compression.

In summary, DRIFT offers a practical paradigm for extending the effective context window of LLMs without sacrificing reasoning quality. By decoupling knowledge extraction (dynamic, query‑aware compression) from reasoning (large‑scale inference on compact embeddings), it delivers both higher accuracy on knowledge‑intensive long‑context tasks and substantial computational savings, paving the way for more scalable and responsive LLM deployments.

Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

💡 Research Summary

Comments & Academic Discussion

Leave a Comment