Benchmarking Large Language Models for Knowledge Graph Validation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Knowledge Graphs (KGs) store structured factual knowledge by linking entities through relationships, crucial for many applications. These applications depend on the KG’s factual accuracy, so verifying facts is essential, yet challenging. Expert manual verification is ideal but impractical on a large scale. Automated methods show promise but are not ready for real-world KGs. Large Language Models (LLMs) offer potential with their semantic understanding and knowledge access, yet their suitability and effectiveness for KG fact validation remain largely unexplored. In this paper, we introduce FactCheck, a benchmark designed to evaluate LLMs for KG fact validation across three key dimensions: (1) LLMs internal knowledge; (2) external evidence via Retrieval-Augmented Generation (RAG); and (3) aggregated knowledge employing a multi-model consensus strategy. We evaluated open-source and commercial LLMs on three diverse real-world KGs. FactCheck also includes a RAG dataset with 2+ million documents tailored for KG fact validation. Additionally, we offer an interactive exploration platform for analyzing verification decisions. The experimental analyses demonstrate that while LLMs yield promising results, they are still not sufficiently stable and reliable to be used in real-world KG validation scenarios. Integrating external evidence through RAG methods yields fluctuating performance, providing inconsistent improvements over more streamlined approaches – at higher computational costs. Similarly, strategies based on multi-model consensus do not consistently outperform individual models, underscoring the lack of a one-fits-all solution. These findings further emphasize the need for a benchmark like FactCheck to systematically evaluate and drive progress on this difficult yet crucial task.

💡 Research Summary

The paper introduces FactCheck, a comprehensive benchmark designed to evaluate large language models (LLMs) on the task of knowledge‑graph (KG) fact validation. KG triples (<subject, predicate, object>) are transformed into natural‑language statements, and the benchmark measures how accurately LLMs can judge their truthfulness under three distinct conditions: (1) using only the model’s internal knowledge, (2) augmenting the model with external evidence via Retrieval‑Augmented Generation (RAG), and (3) aggregating predictions from multiple LLMs through a majority‑vote consensus. FactCheck draws on three real‑world KG datasets—FactBench, YAGO, and DBpedia—providing a total of 13,530 annotated triples that span everyday facts to domain‑specific knowledge. In addition, the authors compile a large‑scale RAG corpus of over two million documents, derived from simulated Google Search Engine Result Pages (SERPs), and expose a mock API to ensure reproducibility of retrieval pipelines. A web interface (https://factcheck.dei.unipd.it/) allows users to explore each verification step, inspect evidence, and conduct error‑type analysis.

The experimental setup includes both open‑source midsized models (7–9 B parameters such as LLaMA‑7B, Mistral‑8B) and commercial offerings (GPT‑4, Claude). Results reveal three key findings. First, when relying solely on internal knowledge, LLMs achieve modest accuracy (~58 %) and struggle especially with recent or low‑frequency facts, reflecting knowledge‑cutoff and hallucination issues. Second, adding RAG can boost performance by 5–12 % for some models, but the gains are highly inconsistent. Errors in query formulation, noisy retrieval results, or over‑reliance on retrieved snippets often cause performance drops, and the RAG pipeline incurs 2–3× higher token usage and inference latency, raising cost concerns. Third, a simple multi‑model majority vote yields only marginal improvements (1–3 % accuracy increase) and sometimes amplifies shared misconceptions because many models are trained on overlapping corpora. The ensemble also multiplies computational overhead, limiting practical deployment.

The authors conduct a thorough qualitative error analysis, categorizing failures into (a) knowledge gaps, (b) hallucinated reasoning, (c) evidence misinterpretation, and (d) consensus conflicts. Based on these observations, they outline future research directions: more sophisticated evidence selection and confidence estimation for RAG, dynamic prompting and chain‑of‑thought techniques to better fuse internal and external knowledge, hierarchical or cost‑aware ensemble strategies, and continuous expansion of FactCheck to cover additional languages, domains, and temporal dynamics.

In summary, FactCheck fills a gap in the evaluation landscape by providing the first benchmark that systematically assesses LLMs for KG fact checking across internal, external, and aggregated knowledge dimensions. The study demonstrates that current LLMs, even when enhanced with retrieval or ensembles, are not yet reliable enough for large‑scale, real‑world KG validation, highlighting the need for improved retrieval mechanisms, reasoning frameworks, and efficient ensemble designs. FactCheck’s datasets, mock APIs, and interactive tools constitute a valuable infrastructure for the community to track progress and drive advances toward trustworthy, scalable KG verification.

Benchmarking Large Language Models for Knowledge Graph Validation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment