Can LLMs Reconcile Knowledge Conflicts in Counterfactual Reasoning
Large Language Models have been shown to contain extensive world knowledge in their parameters, enabling impressive performance on many knowledge intensive tasks. However, when deployed in novel settings, LLMs often encounter situations where they must integrate parametric knowledge with new or unfamiliar information. In this work, we explore whether LLMs can combine knowledge in-context with their parametric knowledge through the lens of counterfactual reasoning. Through synthetic and real experiments in multi-hop reasoning problems, we show that LLMs generally struggle with counterfactual reasoning, often resorting to exclusively using their parametric knowledge. Moreover, we show that simple post-hoc finetuning can struggle to instill counterfactual reasoning ability – often leading to degradation in stored parametric knowledge. Ultimately, our work reveals important limitations of current LLM’s abilities to re-purpose parametric knowledge in novel settings.
💡 Research Summary
The paper investigates whether large language models (LLMs) can reconcile conflicts between the factual knowledge stored in their parameters and novel, counterfactual premises supplied at inference time. The authors frame this problem as “counterfactual reasoning” that requires two complementary abilities: (1) Contextual Override – the model must temporarily suppress a stored fact (e.g., “Paris is in France”) and accept a contradictory premise (“Paris is in Italy”); and (2) Selective Retrieval – the model must still retrieve and use related stored knowledge (e.g., the link between “Eiffel Tower” and “Paris”) to answer a multi‑hop query correctly.
To probe these abilities, the authors construct a suite of benchmarks that combine synthetic graph‑based tasks (extending prior “Grokked” reasoning benchmarks) with real‑world causal reasoning scenarios. They categorize counterfactual contexts into four scenarios: (1) Reinforcing prior knowledge (the premise repeats an existing edge), (2) Adding new information (the premise introduces a missing edge), (3) Contradicting prior knowledge (the premise directly opposes a stored edge), and (4) Irrelevant information (the premise is unrelated to the query).
Three prompting strategies are evaluated: standard direct queries, chain‑of‑thought (CoT) prompting, and fine‑tuning on a small set of counterfactual examples with CoT explanations (≈160 examples). The models tested include GPT‑4o, the reasoning‑oriented GPT‑5 (referred to as “Thinking”), a fine‑tuned version of GPT‑4o, and LLaMA 3.1 8B (results for LLaMA are in the appendix).
Key empirical findings:
- Scenario 1 (Reinforcement) – All models achieve very high accuracy (90 %–100 %). When the context aligns with stored knowledge, LLMs reliably use their parametric facts.
- Scenario 2 (Adding new information) – Non‑fine‑tuned models hover around 60‑75 % accuracy, while fine‑tuning boosts performance to ≈90 %. The improvement shows that modest supervised exposure can help the model learn to incorporate new edges, but the gain is limited.
- Scenario 3 (Contradiction) – Accuracy collapses to the random baseline (~50 %). Fine‑tuning yields only marginal gains. This reveals a fundamental difficulty: strong priors encoded in the weights are hard to override, leading to two failure modes identified by the authors: (a) context‑ignoring (the model defaults to stored facts) and (b) context‑overfitting (the model follows the premise but loses essential intermediate links).
- Scenario 4 (Irrelevant context) – Most models maintain high performance; GPT‑5 reaches near‑perfect scores, while fine‑tuned GPT‑4o improves over the standard baseline. LLaMA 3.1, however, shows a performance drop after fine‑tuning, suggesting that model size and architecture influence how fine‑tuning interacts with counterfactual reasoning.
CoT prompting provides modest benefits but does not resolve the core conflict between context and parametric knowledge. The authors argue that the observed bias toward stored facts is amplified by post‑training alignment procedures (RLHF, factuality rewards) that explicitly reward consistency with memorized knowledge and penalize speculative deviations. Such alignment makes it harder for a model to accept counterfactual premises, even when they are plausible.
To isolate the effect of alignment, the authors conduct controlled synthetic experiments with small transformers trained from scratch on randomly generated directed graphs. Even without any alignment pressure, models still exhibit the same two failure modes when presented with contradictory premises, indicating that the limitation is rooted in the architecture’s inability to dynamically modify its internal knowledge graph on the fly.
The paper concludes that while modern LLMs excel at memorizing facts and performing multi‑hop reasoning over them, they lack robust mechanisms for on‑the‑fly knowledge modification. Simple fine‑tuning or prompting tricks are insufficient to endow them with reliable counterfactual reasoning. The authors suggest future directions such as integrating explicit memory modules, designing meta‑learning objectives that balance trust in parametric versus contextual sources, and revisiting alignment objectives to preserve flexibility under hypothetical scenarios.
Overall, the work provides a systematic diagnosis of a critical blind spot in current LLM capabilities and highlights the need for new modeling and training paradigms that can seamlessly blend stored knowledge with novel, potentially conflicting information.
Comments & Academic Discussion
Loading comments...
Leave a Comment