Citation-Grounded Code Comprehension: Preventing LLM Hallucination Through Hybrid Retrieval and Graph-Augmented Context
Large language models have become essential tools for code comprehension, enabling developers to query unfamiliar codebases through natural language interfaces. However, LLM hallucination, generating plausible but factually incorrect citations to source code, remains a critical barrier to reliable developer assistance. This paper addresses the challenges of achieving verifiable, citation grounded code comprehension through hybrid retrieval and lightweight structural reasoning. Our work is grounded in systematic evaluation across 30 Python repositories with 180 developer queries, comparing retrieval modalities, graph expansion strategies, and citation verification mechanisms. We find that challenges of citation accuracy arise from the interplay between sparse lexical matching, dense semantic similarity, and cross file architectural dependencies. Among these, cross file evidence discovery is the largest contributor to citation completeness, but it is largely overlooked because existing systems rely on pure textual similarity without leveraging code structure. We advocate for citation grounded generation as an architectural principle for code comprehension systems and demonstrate this need by achieving 92 percent citation accuracy with zero hallucinations. Specifically, we develop a hybrid retrieval system combining BM25 sparse matching, BGE dense embeddings, and Neo4j graph expansion via import relationships, which outperforms single mode baselines by 14 to 18 percentage points while discovering cross file evidence missed by pure text similarity in 62 percent of architectural queries.
💡 Research Summary
The paper “Citation-Grounded Code Comprehension: Preventing LLM Hallucination Through Hybrid Retrieval and Graph-Augmented Context” addresses a critical reliability issue in AI-assisted software development: the tendency of Large Language Models (LLMs) to hallucinate—that is, generate plausible but factually incorrect citations to source code—when answering developer queries about unfamiliar codebases. This undermines developer trust and productivity, as it can lead to debugging wrong files or following incorrect implementation guidance.
The authors identify the root cause not merely in the LLMs themselves, but in the limitations of current Retrieval-Augmented Generation (RAG) systems for code. These systems typically treat code as flat text, relying solely on textual similarity (either sparse keyword matching like BM25 or dense semantic embeddings) to find relevant evidence. This approach fails to discover cross-file architectural dependencies (e.g., finding an exception class definition in file B when the query retrieves an exception raise in file A), which are crucial for answering a significant portion of real-world comprehension queries. Furthermore, standard RAG lacks enforceable mechanisms to guarantee that the LLM’s citations correspond to the retrieved context.
To solve this, the paper proposes a novel system architecture built on the principle of “citation-grounded generation.” The system integrates three core components:
- Hybrid Retrieval: It combines the strengths of sparse retrieval (BM25) for precise keyword/identifier matching and dense retrieval (BGE embeddings) for capturing semantic similarity. Their scores are fused using optimized weights (α=0.45 for BM25, β=0.55 for dense).
- Graph-Augmented Context Expansion: Recognizing that code has inherent structure, the system pre-processes a codebase to build a graph of import relationships using Neo4j. After the initial hybrid retrieval returns a set of files, the system traverses this graph to discover and boost structurally connected neighbor files, thereby uncovering evidence that pure textual similarity would miss.
- Mechanical Citation Verification: The system mandates that the LLM format citations as `
Comments & Academic Discussion
Loading comments...
Leave a Comment