AttentionRetriever: Attention Layers are Secretly Long Document Retrievers

AttentionRetriever: Attention Layers are Secretly Long Document Retrievers
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Retrieval augmented generation (RAG) has been widely adopted to help Large Language Models (LLMs) to process tasks involving long documents. However, existing retrieval models are not designed for long document retrieval and fail to address several key challenges of long document retrieval, including context-awareness, causal dependence, and scope of retrieval. In this paper, we proposed AttentionRetriever, a novel long document retrieval model that leverages attention mechanism and entity-based retrieval to build context-aware embeddings for long document and determine the scope of retrieval. With extensive experiments, we found AttentionRetriever is able to outperform existing retrieval models on long document retrieval datasets by a large margin while remaining as efficient as dense retrieval models.


💡 Research Summary

Paper Overview
The paper introduces AttentionRetriever, a novel long‑document retrieval model that exploits the attention mechanism of pretrained large language models (LLMs) to score relevance without any additional training. By treating cross‑attention scores between a query and document tokens as similarity measures, the method obtains sentence‑level relevance estimates directly from selected transformer layers. To overcome the “query‑dependency” problem—where background sentences that are not directly similar to the query are missed—the authors augment the attention scores with an entity‑based retrieval step. They construct a lightweight entity graph linking each entity to the sentences that mention it, rank entities according to their association with the query, and then include all sentences containing top‑ranked entities in the final retrieved set.

Key Technical Contributions

  1. Training‑Free Use of Attention – Empirical analysis on LLaMA‑3.2‑3B, Qwen‑2.5‑3B, and Mistral‑7B shows that certain middle‑to‑late layers achieve high retrieval accuracy, while earlier layers focus on independent sub‑queries. The authors therefore select the most informative layers and aggregate their cross‑attention scores across heads to produce a robust relevance signal.

  2. Entity‑Graph Scope Determination – Recognizing that pure attention may ignore necessary background information, the paper builds an entity graph (entities as nodes, sentences as edges) and ranks entities by the summed scores of their containing sentences. This step expands the retrieval scope to include contextually important but low‑similarity passages.

  3. Hybrid Scoring – Attention‑derived scores are combined with a conventional dense embedding similarity (e.g., ANCE) to further improve precision, especially for queries where lexical overlap is limited.

  4. Scalable Long‑Context Handling – The method integrates the Cascading KV‑Cache approximation, allowing attention to be computed over documents of ~100 k tokens with modest memory and time overhead. Experiments demonstrate that attention remains effective in the middle of long texts, mitigating the classic “lost‑in‑the‑middle” issue.

  5. New Benchmark – The authors release a LongDoc dataset containing documents averaging over 100 k words, together with diverse query types. This dataset is the first to exceed the context window of most existing LLMs, providing a realistic testbed for long‑document retrieval.

Experimental Findings

  • On the new LongDoc benchmark and six additional single‑document retrieval datasets, AttentionRetriever outperforms state‑of‑the‑art sparse (BM25) and dense (DPR, ANCE, GTR, mGTE, Grit‑LM) models by 10–15 percentage points in MAP@10 and Recall@100.
  • The approach remains as efficient as dense retrieval: using a 3 B‑parameter LLM, inference speed is 2–3× faster than comparable 7–10 B dense models while delivering comparable or superior accuracy.
  • In multi‑document retrieval settings, where contextual linking is less critical, AttentionRetriever still matches or exceeds baseline performance, confirming its versatility.

Limitations and Future Directions

  • The optimal set of transformer layers is model‑specific; an automated layer‑selection mechanism would improve generality.
  • Entity graph construction relies on the quality of named‑entity recognition; errors can propagate to the final retrieval set.
  • Current design handles single query‑document pairs; extending to complex multi‑condition queries or logical operators will require additional modeling.
  • Further research could explore learned weighting of attention layers, richer entity‑relation graphs, or integration with other attention‑approximation techniques for real‑time deployment.

Conclusion
AttentionRetriever demonstrates that the attention maps of pretrained LLMs, when combined with a simple entity‑graph scope estimator, can serve as an effective, training‑free long‑document retriever. It addresses contextual, causal, and query dependencies that existing retrieval models overlook, achieves strong empirical gains on a newly introduced ultra‑long document benchmark, and does so with modest computational resources. The work opens a promising avenue for leveraging the latent retrieval capabilities already present in large language models.


Comments & Academic Discussion

Loading comments...

Leave a Comment