대규모 언어 모델 없이도 가능한 효율적·검증 가능한 다중 홉 질의응답

February 23, 2026

Reading time: 6 minute

...

📝 Abstract

Multi-hop question answering over knowledge graphs remains computationally challenging due to the combinatorial explosion of possible reasoning paths. Recent approaches rely on expensive Large Language Model (LLM) inference for both entity linking and path ranking, limiting their practical deployment. Additionally, LLM-generated answers often lack verifiable grounding in structured knowledge. We present two complementary hybrid algorithms that address both efficiency and verifiability: (1) LLM-Guided Planning that uses a single LLM call to predict relation sequences executed via breadth-first search, achieving near-perfect accuracy (micro-F1 > 0.90) while ensuring all answers are grounded in the knowledge graph, and (2) Embedding-Guided Neural Search that eliminates LLM calls entirely by fusing text and graph embeddings through a lightweight 6.7M-parameter edge scorer, achieving over 100 times speedup with competitive accuracy. Through knowledge distillation, we compress planning capability into a 4B-parameter model that matches large-model performance at zero API cost. Evaluation on MetaQA demonstrates that grounded reasoning consistently outperforms ungrounded generation, with structured planning proving more transferable than direct answer generation. Our results show that verifiable multi-hop reasoning does not require massive models at inference time, but rather the right architectural inductive biases combining symbolic structure with learned representations.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

Knowledge graphs (KGs) have emerged as powerful structures for representing domain-specific, structured information that supports verifiable, multi-hop reasoning. Meanwhile, large language models (LLMs) trained on vast web-scale corpora have achieved impressive fluency and generalization across a wide range of tasks. However, this same generality and end-to-end training on heterogeneous internet data introduce significant limitations.

Because LLMs are trained on broad, uncurated internet text, they do not always guarantee reliable factual accuracy in specialized or domain-specific contexts. Even when complete contextual information is supplied through prompting or retrieval augmentation, the results remain not fully reliable. LLMs are well known to produce hallucinations, i.e., outputs that are fluent and plausible yet factually incorrect or unsupported [1]- [3]. Recent studies emphasize that hallucination remains one of the most pressing barriers to deploying LLMs in high-stakes scenarios.

In critical domains such as medicine, cybersecurity, and financial analysis, it is insufficient for an LLM to generate fluent answers; responses must be grounded in verifiable evidence, and the reasoning process should be transparent and traceable. Knowledge graphs (KGs) provide a strong foundation for such grounding by encoding structured domain knowledge and enabling explicit traversal and reasoning over relations [4]- [6]. Recent work has also explored constructing KGs directly from unstructured text using LLM-based reasoning [7]. Once such graphs are built, the ability to efficiently traverse them and answer natural language queries becomes crucial for practical deployment. Consequently, integrating KGs with LLMs has emerged as a promising direction for reducing hallucinations, improving factual consistency, and enhancing the interpretability of model outputs.

Several recent methods, such as Plan-on-Graph [8] and Think-on-Graph [9], have explored using LLMs to plan reasoning paths directly over knowledge graphs. In these frameworks, the model performs iterative reasoning by generating traversal steps, one hop at a time, through repeated LLM invocations. While these methods improve interpretability and allow for natural language-based reasoning, they face serious scalability bottlenecks: each traversal step requires a separate LLM call, making the overall process computationally expensive, latency-prone, and often infeasible for large graphs or production-scale systems. As the number of nodes and relations grows, the combinatorial explosion in possible traversal sequences further amplifies these costs.

To address these challenges, we propose and experiment with two complementary strategies for efficient KG exploration. First, we investigate whether modern LLMs, with improved planning and reasoning capabilities, can identify the optimal sequence of relations to explore in a single or limited number of planning steps, thereby reducing repeated calls during multi-hop reasoning. Second, we explore a multimodal approach that combines text embeddings and graph embeddings to traverse the graph entirely without additional LLM invocations. This embedding-guided traversal provides a lightweight and scalable mechanism to approximate reasoning paths efficiently, offering significant improvements in speed and cost over when operating with large models that are expensive.

We argue that combining embedding-guided exploration with LLM planning offers a promising step toward trustworthy, efficient, and explainable reasoning systems, paving the way for safer deployment of LLMs in high-stakes domains.

Knowledge graphs (KGs) represent structured world knowledge as collections of triples (h, r, t), where h and t denote entities and r represents the semantic relation between them [10], [11]. By capturing explicit relational information, KGs provide a foundation for interpretable reasoning across complex domains. Reasoning over KGs can be performed through path traversal or logical inference, where multi-hop reasoning involves connecting distant entities via intermediate relations to answer complex queries [12], [13]. For instance, to answer “Which movies were directed by someone who also acted in The Terminal?” one must traverse multiple relations such as acted in and directed by. Multi-hop reasoning thus enables compositional question answering and explainable inference by revealing the intermediate reasoning chain.

Large language models (LLMs) such as GPT [14], LLaMA [15], and Qwen [16] are pre-trained on massive text corpora drawn from the internet. This broad pretraining grants them strong generalization and fluency but limits factual reliability in specialized or knowledge-intensive domains. Since the underlying corpora are uncurated and they often exhibit hallucination,producing fluent but factually incorrect or unverifiable statements [1]- [3]. While retrieval-augmented generation (RAG) [17] and prompting with contextual docu

View Original ArXiv

This content is AI-processed based on ArXiv data.

대규모 언어 모델 없이도 가능한 효율적·검증 가능한 다중 홉 질의응답

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found