LLMs as Repositories of Factual Knowledge: Limitations and Solutions

LLMs as Repositories of Factual Knowledge: Limitations and Solutions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

LLMs’ sources of knowledge are data snapshots containing factual information about entities collected at different timestamps and from different media types (e.g. wikis, social media, etc.). Such unstructured knowledge is subject to change due to updates through time from past to present. Equally important are the inconsistencies and inaccuracies occurring in different information sources. Consequently, the model’s knowledge about an entity may be perturbed while training over the sequence of snapshots or at inference time, resulting in inconsistent and inaccurate model performance. In this work, we study the appropriateness of Large Language Models (LLMs) as repositories of factual knowledge. We consider twenty-four state-of-the-art LLMs that are either closed-, partially (weights), or fully (weight and training data) open-source. We evaluate their reliability in responding to time-sensitive factual questions in terms of accuracy and consistency when prompts are perturbed. We further evaluate the effectiveness of state-of-the-art methods to improve LLMs’ accuracy and consistency. We then propose ENtity-Aware Fine-tuning (ENAF), a soft neurosymbolic approach aimed at providing structured representation of entities during fine-tuning to reduce inconsistencies and improve response stability under prompt variations.


💡 Research Summary

This paper investigates the suitability of large language models (LLMs) as repositories of factual knowledge that changes over time. Recognizing that LLMs are trained on multiple data snapshots collected at different timestamps, the authors argue that such static training leads to outdated or fragmented knowledge, especially for time‑sensitive facts. To evaluate this problem, they introduce DyKnow, a dynamic benchmark that continuously refreshes its fact set using real‑time Wikidata entries. DyKnow assesses 24 state‑of‑the‑art LLMs (including GPT‑2, GPT‑3, T5, GPT‑J, Bloom, Flan‑T5, GPT‑4, Llama‑2, Falcon, Vicuna, Mistral, Mixtral, and ChatGPT) on both accuracy (whether the model returns the current attribute) and consistency (stability under prompt perturbations). Prompt perturbations are of two types: subject perturbations (different lexicalizations of the same entity) and property perturbations (different phrasings of the queried property). Results show that even the best models achieve only ~80 % accuracy, with 10‑30 % of answers being outdated, and consistency degrades markedly under subject variations.

The study then compares four knowledge‑editing methods—ROME, MEMIT (parameter‑modifying) and SERAC, IKE (parameter‑preserving)—as well as Retrieval‑Augmented Generation (RAG). Parameter‑modifying edits quickly update specific facts but can cause unintended ripple effects; parameter‑preserving methods avoid such side effects but have lower edit success rates. RAG improves up‑to‑date performance by fetching external information at inference time, yet its reliance on a separate retrieval component introduces its own error sources.

To address the root cause of inconsistency, the authors propose ENtity‑Aware Fine‑tuning (ENAF), a soft neurosymbolic approach that injects structured entity representations (unique IDs and named‑entity tags) into the fine‑tuning data. By mapping all lexical variants of an entity to a single symbolic token, ENAF dramatically improves response stability: consistency under subject perturbations rises by ~15 percentage points, and overall accuracy gains 3‑5 percentage points. ENAF outperforms RAG on consistency while preserving the original model parameters.

Finally, the paper discusses how the ENAF‑DyKnow framework can be extended to multimodal and speech‑driven LLMs, where utterance variability makes robust entity handling essential. The authors conclude that dynamic benchmarking combined with entity‑centric neurosymbolic fine‑tuning is a promising path toward making LLMs reliable, up‑to‑date knowledge repositories.


Comments & Academic Discussion

Loading comments...

Leave a Comment