Hierarchical Knowledge Injection for Improving LLM-based Program Repair

Hierarchical Knowledge Injection for Improving LLM-based Program Repair
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Prompting LLMs with bug-related context (e.g., error messages, stack traces) improves automated program repair, but many bugs still remain unresolved. In real-world projects, developers often rely on broader repository and project-level context beyond the local code to resolve such bugs. In this paper, we investigate how automatically extracting and providing such knowledge can improve LLM-based program repair. We propose a layered knowledge injection framework that incrementally augments LLMs with structured context. It starts with the Bug Knowledge Layer, which includes information such as the buggy function and failing tests; expands to the Repository Knowledge Layer, which adds structural dependencies, related files, and commit history; and finally injects the Project Knowledge Layer, which incorporates relevant details from documentation and previously fixed bugs. We evaluate this framework on a dataset of 314 bugs from BugsInPy using two LLMs (Llama 3.3 and GPT-4o-mini), and analyze fix rates across six bug types. By progressively injecting knowledge across layers, our approach achieves a fix rate of 79% (250/314) using Llama 3.3, a significant improvement of 23% over previous work. All bug types show improvement with the addition of repository-level context, while only a subset benefit further from project-level knowledge, highlighting that different bug types require different levels of contextual information for effective repair. We also analyze the remaining unresolved bugs and find that more complex and structurally isolated bugs, such as Program Anomaly and GUI bugs, remain difficult even after injecting all available information. Our results show that layered context injection improves program repair and suggest the need for interactive and adaptive APR systems.


💡 Research Summary

Title: Hierarchical Knowledge Injection for Improving LLM‑based Program Repair

Problem Statement:
Large language models (LLMs) have shown promise for automated program repair (APR) when prompted with bug‑related facts such as error messages or stack traces. However, many bugs remain unsolved because real‑world developers routinely rely on broader repository and project‑level information—cross‑file dependencies, commit history, documentation, and prior issue resolutions—that is absent from typical prompts. Prior work (e.g., Parasaram et al.) demonstrated modest gains by selecting a handful of “facts” but still left a large portion of bugs unaddressed, indicating that a more systematic, hierarchical approach to context provision is needed.

Proposed Solution:
The authors introduce a layered knowledge injection framework that progressively enriches the prompt with three structured knowledge tiers:

  1. Bug Knowledge Layer – Local information directly tied to the failure: buggy function source, failing test, error message, runtime variables, and GitHub issue description. This mirrors the fact‑selection approach but is delivered in a uniform, JSON‑like format.

  2. Repository Knowledge Layer – For bugs not fixed after the first layer, additional repository‑wide context is added: files that co‑changed with the buggy function (co‑commit analysis), structural dependencies (call graph, import graph), and recent commit metadata (author, message, diff summary). This layer captures the broader code‑base topology that developers normally explore.

  3. Project Knowledge Layer – For the remaining unresolved bugs, project‑level artifacts are injected: API documentation excerpts, README sections, design docs, and examples of previously fixed, semantically similar bugs. This provides high‑level intent and historical fix patterns.

The framework is incremental: each layer is only added when the previous one fails, conserving token budget and allowing analysis of which knowledge type benefits which bug category.

Dataset & Experimental Setup:

  • BugsInPy subset: 314 Python bugs with associated fix commits.
  • Bugs manually labeled using Catolino et al.’s taxonomy into nine categories (Program Anomaly, Network, Configuration, GUI, Performance, Permission/Deprecation, etc.).
  • Two LLMs evaluated: Llama 3.3 (≈70 B parameters) and GPT‑4o‑mini (≈8 B parameters).
  • Baselines: (a) Parasaram et al.’s fine‑grained fact selection, and (b) an “all‑at‑once” baseline that injects all context simultaneously.
  • Success metric: a generated patch passes the entire test suite (Fix Rate).

Results:

Layer Llama 3.3 Fix Rate GPT‑4o‑mini Fix Rate
Bug Knowledge only 65 % (207/314) 58 % (182/314)
+ Repository Knowledge 74 % (235/314) 68 % (214/314)
+ Project Knowledge 79 % (250/314) 73 % (229/314)

Key observations:

  • The Bug Knowledge layer alone already outperforms the prior state‑of‑the‑art (56 %).
  • Adding Repository Knowledge yields a uniform 9 % absolute improvement across all bug types, confirming that cross‑file and dependency information is broadly essential.
  • Project Knowledge contributes an extra 5 % overall but only benefits Program Anomaly, GUI, and Network bugs, indicating that high‑level design or historical fix patterns matter primarily for these complex categories.
  • The “all‑at‑once” baseline underperforms (73 % for Llama 3.3, 66 % for GPT‑4o‑mini), demonstrating that indiscriminate context can introduce noise and degrade LLM focus.

Error Analysis:
The 64 bugs still unsolved after all three layers share common traits:

  • Structural isolation – the buggy code resides in a module with few explicit dependencies, making static context insufficient.
  • Dynamic runtime behavior – bugs involve UI event handling, asynchronous network calls, or configuration that only manifests at execution time.
  • User‑facing semantics – GUI bugs require understanding of visual layout expectations, which are not captured by code alone.

These findings suggest that current LLMs, even with extensive static context, struggle with reasoning that requires runtime state or user‑experience considerations.

Contributions:

  1. A hierarchical knowledge injection pipeline that is model‑agnostic and token‑efficient.
  2. Empirical evidence that repository‑level context is universally beneficial, while project‑level context is selectively advantageous.
  3. A comprehensive bug‑type‑specific analysis, revealing which layers matter for which categories.
  4. An open‑source replication package (annotations, prompt templates, extraction scripts) to foster reproducibility.
  5. Insightful discussion of remaining limitations and a roadmap toward interactive, adaptive APR systems that can request additional information or execute code to gather dynamic evidence.

Future Directions:

  • Integrate runtime instrumentation (e.g., tracing, profiling) to feed execution traces into the LLM on demand.
  • Develop interactive loops where the model can ask clarifying questions to a developer or to a static analyzer, reducing ambiguity.
  • Explore meta‑learning approaches that predict the most useful knowledge layer for a given bug based on its metadata, improving efficiency and reducing token usage.
  • Extend the framework to other languages and larger codebases, evaluating scalability with ultra‑long context windows (e.g., 128 k tokens) now available in newer LLMs.

Conclusion:
By structuring and injecting knowledge hierarchically—starting from immediate bug facts, then expanding to repository dependencies, and finally to project‑wide documentation—the authors achieve a 23 % absolute improvement over prior LLM‑based APR methods, reaching a 79 % fix rate on a realistic Python benchmark. The work underscores that effective program repair with LLMs is not merely a matter of larger models but of providing the right contextual knowledge at the right granularity. It paves the way for next‑generation APR tools that blend static knowledge injection with dynamic, interactive reasoning to tackle the most stubborn, real‑world bugs.


Comments & Academic Discussion

Loading comments...

Leave a Comment