Pruning Minimal Reasoning Graphs for Efficient Retrieval-Augmented Generation

Pruning Minimal Reasoning Graphs for Efficient Retrieval-Augmented Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Retrieval-augmented generation (RAG) is now standard for knowledge-intensive LLM tasks, but most systems still treat every query as fresh, repeatedly re-retrieving long passages and re-reasoning from scratch, inflating tokens, latency, and cost. We present AutoPrunedRetriever, a graph-style RAG system that persists the minimal reasoning subgraph built for earlier questions and incrementally extends it for later ones. AutoPrunedRetriever stores entities and relations in a compact, ID-indexed codebook and represents questions, facts, and answers as edge sequences, enabling retrieval and prompting over symbolic structure instead of raw text. To keep the graph compact, we apply a two-layer consolidation policy (fast ANN/KNN alias detection plus selective $k$-means once a memory threshold is reached) and prune low-value structure, while prompts retain only overlap representatives and genuinely new evidence. We instantiate two front ends: AutoPrunedRetriever-REBEL, which uses REBEL as a triplet parser, and AutoPrunedRetriever-llm, which swaps in an LLM extractor. On GraphRAG-Benchmark (Medical and Novel), both variants achieve state-of-the-art complex reasoning accuracy, improving over HippoRAG2 by roughly 9–11 points, and remain competitive on contextual summarize and generation. On our harder STEM and TV benchmarks, AutoPrunedRetriever again ranks first, while using up to two orders of magnitude fewer tokens than graph-heavy baselines, making it a practical substrate for long-running sessions, evolving corpora, and multi-agent pipelines.


💡 Research Summary

Retrieval‑augmented generation (RAG) has become the de‑facto approach for grounding large language models (LLMs) in external knowledge, yet most deployments treat every user query as an isolated request. This naïve strategy forces the system to re‑retrieve long passages and re‑reason from scratch for each question, inflating token counts, latency, and monetary cost—especially problematic in long‑running sessions, multi‑agent pipelines, or when the underlying corpus evolves over time.
The paper introduces AutoPrunedRetriever, a graph‑style RAG architecture that persistently stores the minimal reasoning subgraph built for earlier questions and incrementally extends it for later ones. The system operates on a symbol‑first pipeline: free‑form text is parsed into entity‑relation‑entity (E‑R‑E) triples using either the REBEL extractor or an LLM‑based extractor, and each unique entity, relation, and triple is assigned a compact ID in a meta‑codebook (E, R, M). Questions, answers, and factual evidence are represented as sequences of edge IDs rather than raw passages.
Three design principles guide the architecture:

  1. Local, Incremental Structure (P1) – Instead of inserting every triple into a monolithic global graph, the system builds small, coherent runs (local subgraphs) on the fly. A run is extended only when a new triple scores high on semantic cohesion (cosine similarity of embeddings) and structural continuity (continuing a path). When the score falls below a threshold, the current run is closed, linearized, and stored; a new run then begins. This avoids costly global alias resolution while preserving global consistency through shared IDs.

  2. Path‑Centric Retrieval (P2) – Reasoning is realized by short edge sequences (paths) rather than broad neighborhood expansions. Retrieval proceeds in two stages. The coarse stage works purely in symbol space: for each run, entity and relation embeddings are compared to the query embeddings, and the top‑k runs are kept as a high‑recall shortlist. The fine stage re‑embeds the actual triples of those shortlisted runs and applies a composite scoring function that accounts for relational strength, coverage of query triples, many‑to‑many overlap, a greedy 1‑to‑1 alignment term, and a whole‑chunk bonus gated by full‑text similarity. This two‑layer scheme reduces the computational complexity from O(N) (scanning the whole graph) to O(k) while preserving precision.

  3. Exact Symbolic Reuse (P3) – When multiple queries share overlapping evidence, the system reuses the symbolic IDs instead of re‑sending the same text. For each retrieval channel (answers, facts, prior questions) a selector chooses among three actions: include‑all, unique (keep a single representative per semantic cluster), or exclude. The unique option clusters runs in embedding space and selects a consensus representative, dramatically cutting token duplication. A lightweight Direct Preference Optimization (DPO) policy, trained on a utility function that balances accuracy, faithfulness, token count, and latency, automatically picks the best selector configuration per query, allowing the system to adapt to different budget constraints or ambiguity levels.

Consolidation and Pruning are performed continuously to keep the persistent graph compact. A first layer maintains an Approximate Nearest Neighbor (ANN) graph over entity embeddings; pairs with cosine similarity above a conservative threshold are provisionally grouped as aliases. When the total number of entities exceeds a memory budget, a second layer triggers a k‑means clustering, computes medoid representatives, and remaps all triples to these medoids, eliminating duplicate edges. The authors prove (Lemmas 9‑11) that this process can only reduce the number of edges and sequences without increasing the number of raw text encodings.

Prompt Construction leverages the compact codebook. Two encoding formats are offered: (1) “word triples” that list the textual (head, relation, tail) triples directly for low‑redundancy scenarios, and (2) “compact indices” that replace each entity and relation with its short ID. The final prompt payload consists of the selected entity set, relation set, and the ID‑based sequences for the query, answer, and supporting facts, plus a brief textual header explaining the ID format. Token cost scales with |E′| + |R′| + |q| + |a| + |f|, which is typically orders of magnitude smaller than concatenating raw passages.

Experiments

The authors evaluate on three fronts:

  • Complex Reasoning Accuracy (RQ1) – Using the GraphRAG‑Benchmark (Medical and Novel domains) and two newly curated STEM and TV datasets, both AutoPrunedRetriever‑REBEL and AutoPrunedRetriever‑LLM achieve state‑of‑the‑art performance. They improve over the strong baseline HippoRAG2 by roughly 9–11 percentage points on complex multi‑hop reasoning tasks, and rank first on the harder STEM and TV benchmarks.

  • Efficiency (RQ2) – Token usage drops dramatically: up to two orders of magnitude fewer tokens than graph‑heavy baselines (e.g., from ~2,000 tokens per query down to ~20–30 tokens). Latency is reduced by an average of 30 % thanks to the coarse‑to‑fine retrieval and the compact prompt format. Memory footprint remains modest, growing only linearly with the number of unique entities, which is kept in check by the two‑layer consolidation.

  • Overall Benchmark Performance (RQ3) – On the full GraphRAG benchmark, AutoPrunedRetriever maintains competitive scores on contextual summarization and generation tasks, demonstrating that the aggressive pruning does not sacrifice general language generation quality.

Significance and Future Directions

AutoPrunedRetriever reframes RAG from “retrieve everything that might be relevant” to “identify, cache, and reuse the minimal reasoning structure”. By persisting only the essential subgraph, treating edge sequences as the primary retrieval unit, and employing a symbolic ID‑based prompt, the system achieves a rare combination of high accuracy and extreme efficiency. This makes it especially suitable for long‑running conversational agents, multi‑agent pipelines (planner‑researcher‑verifier loops), and environments where the knowledge base evolves over time.

Future work could explore (1) richer logical operators (temporal constraints, conditional reasoning), (2) multimodal extensions (incorporating tables, figures, or images into the symbolic graph), and (3) dynamic updating of entity and relation embeddings to reflect newly ingested data without full re‑training. Overall, the paper provides a compelling blueprint for next‑generation, cost‑effective retrieval‑augmented generation systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment