Context Dependence and Reliability in Autoregressive Language Models
Large language models (LLMs) generate outputs by utilizing extensive context, which often includes redundant information from prompts, retrieved passages, and interaction history. In critical applications, it is vital to identify which context elements actually influence the output, as standard explanation methods struggle with redundancy and overlapping context. Minor changes in input can lead to unpredictable shifts in attribution scores, undermining interpretability and raising concerns about risks like prompt injection. This work addresses the challenge of distinguishing essential context elements from correlated ones. We introduce RISE (Redundancy-Insensitive Scoring of Explanation), a method that quantifies the unique influence of each input relative to others, minimizing the impact of redundancies and providing clearer, stable attributions. Experiments demonstrate that RISE offers more robust explanations than traditional methods, emphasizing the importance of conditional information for trustworthy LLM explanations and monitoring.
💡 Research Summary
The paper tackles a pressing problem in the interpretability of large language models (LLMs): when these models are fed long, structured contexts that contain many overlapping or duplicated pieces of information, existing explanation techniques (attention visualizations, gradient saliency, token‑masking perturbations, etc.) often misattribute importance. Small changes such as re‑ordering, paraphrasing, or adding redundant copies can cause attribution scores to swing dramatically even though the model’s output remains unchanged. This instability undermines trust and opens the door to attacks like prompt injection, where an adversary hides malicious instructions among legitimate context.
To address this, the authors propose a principled, information‑theoretic approach called RISE (Redundancy‑Insensitive Scoring of Explanation). The core idea is to evaluate each “context unit” (e.g., system instruction, user query, retrieved document chunk, dialogue turn, tool output, memory entry) not in isolation but conditionally on all other units. They define Conditional Unique Dependence (CUD) as the conditional mutual information between a unit (C_i) and the explanation target (the next‑token distribution (\hat T) or a generated span) given the rest of the context (C_{\setminus i}): \
Comments & Academic Discussion
Loading comments...
Leave a Comment