Diagnosing Retrieval Bias Under Multiple In-Context Knowledge Updates in Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

LLMs are widely used in knowledge-intensive tasks where the same fact may be revised multiple times within context. Unlike prior work focusing on one-shot updates or single conflicts, multi-update scenarios contain multiple historically valid versions that compete at retrieval, yet remain underexplored. This challenge resembles the AB-AC interference paradigm in cognitive psychology: when the same cue A is successively associated with B and C, the old and new associations compete during retrieval, leading to bias. Inspired by this, we introduce a Dynamic Knowledge Instance (DKI) evaluation framework, modeling multi-updates of the same fact as a cue paired with a sequence of updated values, and assess models via endpoint probing of the earliest (initial) and latest (current) states. Across diverse LLMs, we observe that retrieval bias intensifies as updates increase, earliest-state accuracy stays high while latest-state accuracy drops substantially. Diagnostic analyses of attention, hidden-state similarity, and output logits further reveal that these signals become flatter and weakly discriminative on errors, providing little stable basis for identifying the latest update. Finally, cognitively inspired heuristic intervention strategies yield only modest gains and do not eliminate the bias. Our results reveal a persistent challenge in tracking and following knowledge updates in long contexts.

💡 Research Summary

The paper investigates a previously under‑explored failure mode of large language models (LLMs): when a single factual cue is updated multiple times within a single context, the model tends to retrieve older versions of the fact rather than the most recent one. This phenomenon, which the authors term “retrieval bias,” mirrors the classic AB‑AC interference effect from cognitive psychology, where a cue (A) first learned with an associate (B) later competes with a new associate (C) during recall. By adapting this paradigm, the authors create a controlled evaluation framework called Dynamic Knowledge Instance (DKI). A DKI consists of a cue (e.g., “President of Italy”) paired with a sequence of values representing successive updates (V₁, V₂, …, V_T). The framework supports two types of queries: (1) an “earliest‑state” query that asks for V₁, and (2) a “latest‑state” query that asks for V_T. The difference in accuracy between these two queries, called the Early‑Latest Accuracy Gap (ELAG), quantifies retrieval bias as the number of updates T grows.

Two families of DKIs are constructed. Synthetic DKIs use random English words for cues and values, eliminating interference from the model’s parametric world knowledge and allowing precise control over T (1, 3, 5, 7). Real‑world DKIs are derived from the EvolvEBench dataset, which contains temporally evolving factual attributes (e.g., corporate CEOs, political leaders). Both families are evaluated under identical prompting conditions across a broad suite of publicly available LLMs, including GPT‑3.5, GPT‑4, Claude‑2, LLaMA‑2 (7B/13B), Mistral‑7B, and others. Context windows are capped at 4 k tokens, and each configuration is tested on 1,000 randomly sampled DKIs.

The empirical findings are striking. As T increases, the accuracy for the latest‑state query drops dramatically (often below 30 % for T = 7), while the earliest‑state accuracy remains high (80‑90 %). Consequently, ELAG widens linearly with the number of updates. Larger models tend to preserve the earliest fact even more strongly, suggesting that parametric memory is robust but contextual updating is fragile. To understand why, the authors probe three internal signals during answer generation: (i) attention weights from the answer token to each candidate value, (ii) hidden‑state similarity (cosine similarity between the answer position’s hidden vector and the mean hidden vector of each candidate), and (iii) output logits and softmax probabilities for each candidate. In correct cases, attention and similarity scores show clear peaks for the correct (latest) value, especially in middle‑to‑upper layers. In error cases, however, these scores flatten: attention distributes almost uniformly across all candidates, hidden‑state similarity collapses to near‑zero variance, and logits for the latest value lose dominance, yielding near‑uniform probability distributions. This flattening indicates that the model lacks a stable, discriminative internal cue for the most recent update.

Motivated by memory‑enhancement strategies from cognitive psychology, the authors experiment with four prompt‑level interventions: (1) a meta‑prompt that explicitly asks the model to prioritize the most recent information, (2) insertion of a special token (e.g., ) before the latest value, (3) repeated re‑encoding of the latest value to increase its salience, and (4) adding temporal markers (“currently”, “as of 2025”) to the latest statement. All interventions produce modest gains (≈5‑8 percentage points) in latest‑state accuracy but fail to close the ELAG substantially. The results suggest that simple prompting tricks cannot overcome the deeper architectural limitation that LLMs treat all context tokens with roughly equal status, leading to cue‑overload competition.

The paper concludes by positioning retrieval bias as a fundamental challenge for any application that requires up‑to‑date factual recall from long contexts (e.g., news summarization, regulatory compliance, dynamic knowledge bases). The authors argue that future work should explore architectural augmentations such as external key‑value memory stores, dynamic parameter updates, or training objectives that explicitly enforce temporal consistency (e.g., contrastive loss between successive updates). They also recommend practical mitigations, such as post‑processing verification against external databases or designing system pipelines that separate “historical” and “current” knowledge streams.

In sum, the study provides a rigorous, psychologically grounded methodology for quantifying multi‑update retrieval bias, offers detailed diagnostics of why LLMs falter, and demonstrates that while prompt‑based heuristics help, they are insufficient. The findings call for more principled solutions that give LLMs a reliable mechanism for tracking and retrieving the most recent version of a fact amidst a sea of older, competing traces.

Diagnosing Retrieval Bias Under Multiple In-Context Knowledge Updates in Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment