FLAME: Empowering Frozen LLMs for Knowledge Graph Completion
Traditional knowledge graph completion (KGC) methods rely solely on structural information and struggle with sparsity, while Large Language Models (LLMs) address these limitations through rich world knowledge and strong context modeling. Fine-tuning LLMs is effective but costly, while non-fine-tuned LLMs are efficient but suboptimal. To address this trade-off, we propose \textbf{FLAME}, a framework that extracts context-aware hidden states from intermediate layers of frozen LLMs to train data-efficient KGC classifiers. We bridge LLM-KG semantic gaps via subgraph-based entity descriptions and employ sliced mutual information (SMI) to quantify task-relevant information in representations. Experiments demonstrate that FLAME achieves 47% improvement over non-fine-tuned LLM baselines and, to our knowledge, is the first to achieve fine-tuned performance with $188\times$ memory efficiency and $26.11\times$ speedup.
💡 Research Summary
The paper tackles a fundamental limitation of traditional knowledge graph completion (KGC) methods, which rely exclusively on graph structure and therefore suffer from sparsity. Large language models (LLMs) possess rich world knowledge and strong contextual reasoning, making them attractive for KGC, but existing approaches either fine‑tune the LLM—incurring heavy computational and memory costs—or use prompt‑based inference without fine‑tuning, which often yields sub‑optimal performance due to misalignment with the task.
FLAME (Frozen LLM‑Based Knowledge Graph Completion) proposes a middle ground: keep the LLM frozen, extract context‑aware hidden states from its intermediate layers, and train only a lightweight classifier (e.g., a two‑layer MLP) on these representations. The framework consists of three tightly coupled components.
-
Subgraph‑Based Entity Description Generator:
- Many KG entities lack detailed textual descriptions, and raw entity names are ambiguous for LLMs.
- For each entity, FLAME retrieves its one‑hop subgraph and creates a structured verbalization by concatenating verbalized triples.
- To make the text more “model‑friendly,” an in‑context learning prompt is sent to a separate LLM (GPT‑3.5) that rewrites the structured text into a natural‑language narrative while preserving the subgraph constraints.
- These narratives serve as bridge sentences that align the symbolic KG space with the semantic space the frozen LLM was trained on, reducing hallucination and improving the relevance of extracted representations.
-
Task‑Specific Prompt Design and Representation Extraction:
- For triple classification, relation prediction, and entity prediction, FLAME defines four prompt templates (PT1–PT4) that embed the entity names and, optionally, the generated descriptions.
- Positive triples are sampled from the KG, while negatives are generated by standard corruption.
- The frozen LLM processes each prompt, and the hidden state of the last token from a chosen intermediate layer (or a set of layers) is recorded as the representation Rep_l_j(s).
- Because the LLM is causal, the last token’s representation captures the model’s accumulated reasoning over the entire prompt, making it a rich source of relational knowledge without requiring generation.
-
Information‑Theoretic Validation via Sliced Mutual Information (SMI):
- To answer whether frozen representations retain sufficient task‑relevant information, the authors compute SMI between the high‑dimensional hidden vectors X and the discrete KGC labels Y.
- SMI approximates the average mutual information over random one‑dimensional projections of X, which scales better to high dimensions than classic MI estimators.
- Monte‑Carlo sampling with the KSG estimator yields a scalar SMI value for each layer. Higher SMI indicates that the layer’s representations are more informative for the downstream KGC task.
- Empirically, FLAME’s SMI scores approach those of fine‑tuned baselines, confirming that the frozen LLM already encodes the needed relational cues.
Experimental Findings
The authors evaluate FLAME on six benchmark datasets: FB13, WN11, FB15K‑237N, WN18RR, UMLS (triple classification), YAGO3‑10 (relation prediction), and WN18RR (entity prediction). Key experimental settings include:
- Using only 3 k (or 10 k for UMLS) training triples, a fraction of the full training set.
- LLaMA‑7B as the primary backbone; additional experiments with Mistral‑7B, Gemma‑7B, and Qwen2.5‑7B demonstrate architecture‑agnostic performance.
- Baselines comprise classic embedding models (TransE, DistMult, ComplEx, RotatE), KG‑BERT, KG‑T5, LLaMA‑ICL, and the strong fine‑tuned KG‑LLAMA.
Results show:
- FLAME (with GPT‑generated descriptions) matches or exceeds KG‑LLAMA’s accuracy on most datasets while using only ~1 % of the training data.
- Relative to non‑fine‑tuned LLM baselines, FLAME improves accuracy by up to 47 % (e.g., on FB13).
- In relation prediction, with just 0.6 % of training triples, FLAME retains 97 % of the performance of a fully fine‑tuned model.
- For entity prediction, the classifier’s confidence scores enable effective ranking, yielding Hits@1 comparable to fine‑tuned methods.
- SMI analysis reveals a 34 % boost when using the subgraph‑derived descriptions, indicating better semantic alignment.
Efficiency Gains
Because only the classifier’s parameters are updated, FLAME reduces GPU memory consumption during training by a factor of 188× compared with parameter‑efficient fine‑tuning (e.g., LoRA). Training and inference speed improve by 26.11×. The overhead of extracting representations is modest and independent of model size, making the approach especially attractive for very large LLMs (e.g., 70 B) where fine‑tuning is prohibitive.
Ablation and Robustness
Ablation studies confirm that (i) entity descriptions are crucial—removing them drops performance by 4–6 %; (ii) the choice of intermediate layer matters, with layers near the middle of the model offering the best trade‑off between semantic richness and task relevance; (iii) simple classifiers (MLP) suffice, while SVM and logistic regression perform slightly worse, underscoring that the extracted representations are already highly discriminative.
Conclusions
FLAME demonstrates that frozen LLMs, when paired with carefully crafted subgraph‑based entity narratives and a principled information‑theoretic assessment, can achieve fine‑tuned‑level KGC performance with dramatically lower computational resources. The framework bridges the semantic gap between symbolic KG data and the pretrained LLM’s language space, validates the latent task‑relevant knowledge via SMI, and offers a scalable, data‑efficient solution for real‑world KG completion tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment