Scaling Knowledge Graph Construction through Synthetic Data Generation and Distillation
Document-level knowledge graph (KG) construction faces a fundamental scaling challenge: existing methods either rely on expensive large language models (LLMs), making them economically nonviable for large-scale corpora, or employ smaller models that produce incomplete and inconsistent graphs. We find that this limitation stems not from model capabilities but from insufficient training on high-quality document-level KG data. To address this gap, we introduce SynthKG, a multi-step data synthesis pipeline that generates high-quality document-KG pairs through systematic chunking, decontextualization, and structured extraction using LLMs. By fine-tuning a smaller LLM on synthesized document-KG pairs, we streamline the multi-step process into a single-step KG generation approach called Distill-SynthKG. Furthermore, we repurpose existing question-answering datasets to construct KG evaluation datasets and introduce new evaluation metrics. Using KGs produced by Distill-SynthKG, we also design a novel graph-based retrieval framework for RAG. Experimental results demonstrate that Distill-SynthKG not only surpasses all baseline models in KG quality (including models up to eight times larger) but also consistently improves in retrieval and question-answering tasks. Additionally, our proposed graph retrieval framework outperforms all KG-retrieval methods across multiple benchmark datasets.
💡 Research Summary
The paper tackles the scalability bottleneck of document‑level knowledge graph (KG) construction, which has traditionally relied on either costly large language models (LLMs) such as GPT‑4o or on smaller models that produce incomplete, inconsistent graphs. The authors argue that the root cause is not model capacity but the lack of high‑quality, document‑level KG training data. To fill this gap they introduce a two‑stage data synthesis pipeline called SynthKG and a distilled single‑step KG generator named Distill‑SynthKG.
SynthKG first splits a long document into semantically coherent chunks. Because processing each chunk in isolation can break entity continuity, a “decontextualization” step rewrites each chunk using the preceding chunk as context, normalizing entity mentions (e.g., turning pronouns or shortened names into the full canonical form). This yields self‑contained chunks with consistent entity naming. The second stage prompts an LLM to extract (1) entities and their types, and then (2) propositions—natural‑language sentences that describe each relation—along with traditional (head, relation, tail) triples. The proposition acts as an intermediate chain‑of‑thought, improving both extraction accuracy and providing a fine‑grained retrieval unit.
Running SynthKG on a large corpus produces on the order of 100 K synthetic document‑KG pairs. These high‑quality pairs are then used to fine‑tune a relatively small LLM (e.g., a 7‑B parameter model). The resulting model, Distill‑SynthKG, can ingest an entire document in a single forward pass and output a complete KG, effectively “distilling” the multi‑step pipeline into model parameters. Experiments show that Distill‑SynthKG matches or exceeds the KG quality of much larger baselines (up to eight times the parameter count) while reducing inference cost dramatically.
For evaluation, the authors repurpose existing multi‑hop QA datasets (MuSiQue, 2WikiMultiHopQA, HotpotQA). They generate proxy triplets from question‑answer pairs using GPT‑4o, ensuring that the final answer appears as head, relation, or tail in at least one triplet. Two novel metrics—semantic‑similarity coverage and keyword‑based coverage—are introduced to quantify how well a KG captures the facts needed to answer the questions. These metrics correlate strongly with downstream QA and retrieval performance, providing a practical way to assess ontology‑free, document‑level KGs.
Building on the generated KGs, the paper proposes a graph‑based retrieval framework for Retrieval‑Augmented Generation (RAG). The system first retrieves relevant propositions, then expands through graph traversal to collect related triples and text chunks. This progressive retrieval outperforms traditional text‑only retrieval and prior KG‑retrieval methods across the three benchmark QA datasets, delivering higher recall and precision, especially for multi‑hop reasoning queries.
In summary, the contributions are:
- SynthKG – a systematic pipeline that creates high‑coverage, ontology‑free document‑level KG training data.
- Distill‑SynthKG – a compact LLM fine‑tuned on synthetic data that achieves large‑model KG quality in a single inference step.
- A method to construct large‑scale KG evaluation datasets from multi‑hop QA and two new coverage metrics.
- A graph‑centric retrieval architecture that leverages the distilled KGs to boost RAG performance.
- Empirical evidence that data‑centric synthesis and distillation can replace brute‑force scaling of model size, enabling cost‑effective, high‑quality KG construction for massive corpora.
Comments & Academic Discussion
Loading comments...
Leave a Comment