Sentinel: Decoding Context Utilization via Attention Probing for Efficient LLM Context Compression
Retrieval-augmented generation (RAG) often suffers from long and noisy retrieved contexts. Prior context compression methods rely on predefined importance metrics or supervised compression models, rather than on the model’s own inference-time behavior. We propose Sentinel, a lightweight sentence-level compression framework that treats context compression as an understanding decoding problem. Sentinel probes native attention behaviors of a frozen LLM with a lightweight readout to decode which parts of the context are actually utilized when answering a query, rather than using attention as a direct relevance score. We empirically observe that decoded relevance signals exhibit sufficient consistency across model scales to support effective compression with compact proxy models. On LongBench, Sentinel with a 0.5B proxy model achieves up to 5x compression while matching the QA performance of 7B-scale baselines, and despite being trained only on English QA data, generalizes effectively to Chinese and out-of-domain settings.
💡 Research Summary
The paper introduces Sentinel, a novel framework for compressing retrieved contexts in Retrieval‑Augmented Generation (RAG) systems by directly probing the internal attention dynamics of a frozen large language model (LLM). Traditional context‑compression methods fall into two categories: metric‑based approaches that rely on handcrafted importance scores (e.g., perplexity, mutual information, query‑context similarity) and data‑driven approaches that train separate compression models using external supervision such as answer spans or generation feedback. Both have drawbacks—metric‑based methods are lightweight but only loosely correlated with the model’s actual inference behavior, while data‑driven methods achieve strong performance at the cost of additional training and often tie the compression policy to a specific downstream objective.
Sentinel reframes the problem as “understanding decoding”: given a query q and a set of sentences C = {s₁,…,sₙ}, the goal is to select a subset C′ ⊆ C that contains exactly the sentences the LLM actually uses to answer q. To achieve this, the authors propose a lightweight probing pipeline that requires no fine‑tuning of the target LLM and no full autoregressive generation of the answer.
Key components of Sentinel:
-
Proxy Model & Prompt – A compact decoder‑only model (default Qwen‑2.5‑0.5B‑Instruct) is kept frozen. The query and the entire retrieved context are fed together with a QA‑style instruction that encourages the model to “summarize” the relevant information at the final token.
-
Final‑Token Attention Extraction – During the single forward pass, the decoder’s attention tensor for the final generated token is captured. This tensor spans all layers (L), heads (H), and input tokens (T). Prior work has shown that the final token’s attention often aggregates a compressed representation of the whole input (an “over‑squashing” effect).
-
Sentence‑Level Feature Aggregation – For each sentence sᵢ, attention weights directed to its constituent tokens are summed and normalized by the total attention mass over the whole context. The resulting vector vᵢ ∈ ℝ^{L·H} encodes how each layer‑head contributes to the model’s focus on that sentence.
-
Lightweight Probing Classifier – A linear logistic regression probe maps vᵢ to a scalar relevance score ŷᵢ = σ(wᵀvᵢ + b). The probe is deliberately simple to avoid learning new behaviors beyond what is already present in the attention patterns, and its weights are directly interpretable (e.g., identifying which heads are most predictive of context utilization).
-
Weak Supervision from QA Data – Training data are constructed automatically from existing QA datasets. Sentences containing the gold answer span are labeled positive; all other sentences are negative. To ensure the supervision reflects genuine reliance on the retrieved context, the authors filter for “context‑reliant” examples—instances where the model fails without the context but succeeds when it is provided. Sentence shuffling during training mitigates positional bias.
-
Inference‑Time Compression – At test time, the same proxy model and prompt are used to extract final‑token attention, sentence‑level features are computed, and the trained probe assigns relevance scores. Sentences are then ranked and a top‑k subset is selected under a token budget (e.g., 2000 tokens) or a compression ratio (e.g., 20 %). The selected sentences are concatenated in original order and passed to the downstream LLM for answer generation.
Experimental Findings
-
Datasets & Evaluation – Experiments are conducted on LongBench (both English and Chinese subsets), focusing on QA tasks and query‑conditioned summarization (QMSum). Downstream models include GPT‑3.5‑Turbo and Qwen‑2.5‑7B‑Instruct. Baselines span metric‑based methods (LLMLingua‑1, LongLLMLingua, Selective‑Context), data‑driven methods (LLMLingua‑2, CPC), and a Raw‑Attention heuristic that directly uses attention weights without probing.
-
Performance – Under a 2000‑token constraint, Sentinel with the 0.5 B proxy achieves up to 5× compression while matching or slightly surpassing the QA‑F1 and ROUGE‑L scores of 7 B‑scale baselines. For example, on English LongBench, Sentinel reaches 37.73 % overall score versus 39.0 % for LongLLMLingua (7 B) and 42.6 % for CPC (7 B). On Chinese LongBench, Sentinel attains 46.16 % overall, outperforming most baselines despite being trained only on English QA.
-
Cross‑Scale Consistency – The same probing pipeline applied with a 1.5 B proxy yields marginal gains, confirming that the decoded relevance signals are robust across model sizes. Moreover, experiments with different families (Qwen‑2.5, Qwen‑3, LLaMA‑3) show consistent behavior, indicating that certain attention heads universally encode query‑context alignment.
-
Interpretability – Analysis of probe weights reveals that a handful of heads in middle layers (e.g., layer 7, head 12) receive the highest coefficients, aligning with prior mechanistic studies that identified these heads as “retrieval‑oriented”. This provides a transparent link between the probing signal and known model internals.
-
Efficiency – Sentinel requires only a single forward pass of a tiny proxy model and a linear probe, incurring negligible overhead compared to full‑generation baselines. No additional compression model training is needed beyond the lightweight probe (≈10 K parameters).
Implications and Future Directions
Sentinel demonstrates that a model’s own attention dynamics can serve as a reliable, interpretable proxy for context relevance, eliminating the need for external importance metrics or heavyweight supervised compression models. This opens several avenues:
-
Fine‑Grained Token‑Level Compression – Extending the aggregation from sentence to token granularity could enable even higher compression ratios for extremely long contexts.
-
Non‑Linear Probes – While linear probes preserve interpretability, modestly deeper probes (e.g., small MLPs) might capture more complex, distributed evidence patterns, especially for multi‑hop reasoning.
-
Multi‑Modal Extensions – The same probing principle could be applied to vision‑language models where attention over image patches may indicate visual evidence relevance.
-
Joint Compression‑Generation Optimization – Integrating the probe’s relevance scores directly into the downstream generation process (e.g., as attention bias) could further improve answer fidelity while maintaining compression.
-
Robustness to Noisy Retrieval – Investigating how Sentinel behaves when the retrieved set contains irrelevant or adversarial passages would solidify its applicability in real‑world retrieval pipelines.
In summary, Sentinel offers a conceptually simple yet powerful method for context compression: by decoding the very attention patterns that a frozen LLM uses to understand a query, it achieves strong compression, cross‑lingual generalization, and interpretability with minimal computational cost. This work bridges mechanistic interpretability research and practical system design, suggesting that future RAG systems can become both more efficient and more transparent by listening to the models themselves.
Comments & Academic Discussion
Loading comments...
Leave a Comment