Are LLMs Really Not Knowledgeable? Mining the Submerged Knowledge in LLMs' Memory

Are LLMs Really Not Knowledgeable? Mining the Submerged Knowledge in LLMs' Memory
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have shown promise as parametric knowledge bases, but often underperform on question answering (QA) tasks due to hallucinations and uncertainty. While prior work attributes these failures to knowledge gaps in the model’s parameters, we uncover a complementary phenomenon: LLMs frequently retain correct knowledge even when generating incorrect or “unsure” answers. By analyzing the token-level output distributions, we find that correct answers often appear among high-probability candidates, despite not being selected. Motivated by this, we propose Hits@k, a novel metric to evaluate latent knowledge retention independent of answer surface form. Our experiments reveal that LLMs possess significantly more factual knowledge than is reflected by standard QA accuracy. Building on these insights, we further examine the prevailing few-shot QA paradigm. We find that prompting strategies which allow “unsure” outputs can inadvertently suppress correct answers by discouraging low-confidence generation. We design a set of quantitative experiments to measure this suppression effect, offering practical guidance for future prompt and decoding design in knowledge-intensive tasks.


💡 Research Summary

The paper investigates a previously under‑explored aspect of large language models (LLMs): even when they produce an incorrect answer or explicitly say they are “unsure,” the correct answer often resides high in the model’s internal probability distribution. This phenomenon, which the authors term a “storage‑expression gap,” indicates that LLMs frequently store factual knowledge in their parameters but fail to express it during generation.

To measure this latent knowledge, the authors introduce Hits@k, defined as the proportion of questions for which the correct answer appears among the top‑k tokens in the model’s logits. Because modern LLM vocabularies contain on the order of 128 k tokens, a relatively small k (e.g., 5, 50, 100) already captures a large share of stored facts while remaining computationally tractable.

Extensive experiments are conducted across a spectrum of models (LLaMA 2/3, Qwen 2, Mistral) ranging from 1.5 B to 70 B parameters, and across three datasets: an open‑domain knowledge base (DBpedia), and two domain‑specific corpora (IMDB for movies and Goodreads for books). The key findings are:

  1. Large storage, modest surface accuracy – For LLaMA 3‑8B on DBpedia‑Head, Hits@1 (standard accuracy) is only 17.2 % while Hits@5 jumps to 57.9 % and Hits@100 exceeds 90 %. This pattern repeats across models and domains, confirming that conventional accuracy severely underestimates the amount of factual knowledge encoded.

  2. Model size is not a reliable predictor of latent knowledge – While increasing parameter count improves accuracy, Hits@k does not consistently rise with size. For example, LLaMA 2‑13B and LLaMA 2‑70B achieve similar Hits@k scores, as do LLaMA 3‑8B and LLaMA 3‑70B.

  3. Newer architectures outperform older ones – Even at comparable scales, newer models (LLaMA 3, Qwen 2) achieve substantially higher Hits@k than their predecessors, suggesting that training data updates and architectural refinements enhance the ability to retrieve stored facts.

  4. Domain and popularity effects – Open‑domain data yields higher Hits@k than specialized domains, likely because specialized facts are less represented in pre‑training corpora. Within each domain, more popular entities (the “head” of the frequency distribution) enjoy higher Hits@k, but the gap is far smaller than the accuracy gap, indicating that even low‑popularity facts can be stored but remain hard to express.

  5. Prompting with “unsure” can suppress correct knowledge – Allowing the model to answer “I don’t know” or similar tokens leads the decoder to down‑weight low‑confidence correct tokens, effectively masking them in the top‑k. When the authors filter out “unsure” tokens during decoding or switch to greedy decoding (temperature = 0), many previously hidden correct answers re‑appear in the top‑k. This demonstrates a trade‑off: encouraging cautious responses may reduce hallucinations but also diminish the model’s capacity to surface latent knowledge.

The authors argue that Hits@k should become a standard evaluation metric for knowledge‑intensive tasks, complementing traditional accuracy. It reveals a hidden reservoir of facts that can be tapped by more sophisticated decoding strategies (e.g., top‑k sampling with re‑ranking, logit‑based answer extraction) or by redesigning prompts to avoid unnecessary “unsure” pathways.

In conclusion, LLMs act as substantial parametric knowledge bases, but current generation pipelines often fail to extract the stored information. By quantifying this storage‑expression gap with Hits@k and exposing the impact of prompting choices, the paper provides both a diagnostic tool and practical guidance for future model development, prompt engineering, and decoding algorithm design aimed at unlocking the full factual potential of large language models.


Comments & Academic Discussion

Loading comments...

Leave a Comment