SpIDER: Spatially Informed Dense Embedding Retrieval for Software Issue Localization
Retrieving code functions, classes or files that are relevant in order to solve a given user query, bug report or feature request from large codebases is a fundamental challenge for Large Language Model (LLM)-based coding agents. Agentic approaches typically employ sparse retrieval methods like BM25 or dense embedding strategies to identify semantically relevant units. While embedding-based approaches can outperform BM25 by large margins, they often don’t take into consideration the underlying graph-structured characteristics of the codebase. To address this, we propose SpIDER (Spatially Informed Dense Embedding Retrieval), an enhanced dense retrieval approach that integrates LLM-based reasoning along with auxiliary information obtained from graph-based exploration of the codebase. We further introduce SpIDER-Bench, a graph-structured evaluation benchmark curated from SWE-PolyBench, SWEBench-Verified and Multi-SWEBench, spanning codebases from Python, Java, JavaScript and TypeScript programming languages. Empirical results show that SpIDER consistently improves dense retrieval performance by at least 13% across programming languages and benchmarks in SpIDER-Bench.
💡 Research Summary
The paper tackles the problem of software issue localization—identifying the exact code units (functions, classes, or files) that need to be edited to resolve a user‑reported bug or feature request—in large, multi‑language codebases. Traditional approaches fall into two camps: sparse retrieval (e.g., BM25) that relies on lexical matching, and dense retrieval that learns a bimodal encoder to embed issues and code snippets into a shared vector space. While dense methods dramatically improve semantic matching, they ignore the inherent graph structure of a repository (containment, call, import, inheritance relationships). The authors observe that buggy or relevant functions are often spatially proximate in the repository graph, and that purely semantic ranking can miss “near‑miss” candidates that are structurally close but have slightly lower embedding similarity.
To address this gap, the authors propose SpIDER (Spatially Informed Dense Embedding Retrieval), a simple yet effective augmentation of dense retrieval with graph‑aware exploration and LLM‑based re‑ranking. The workflow consists of four steps:
-
Semantic Retrieval – A pretrained bimodal encoder F maps the issue description Q and every function v to a shared embedding space. Cosine similarity scores rank all functions, and the top‑K are taken as the initial candidate set S_K(Q).
-
Seed Selection – From S_K(Q) the top‑C (C ≤ K) functions are chosen as “seed centers” C_Q. These seeds are expected to be highly relevant and serve as anchors for graph traversal.
-
Neighborhood Exploration – For each seed, a breadth‑first search is performed on the code graph G = (V,E) along the ‘contains’ edges (hierarchical parent‑child links). All functions within d hops (typically 2–4) from any seed are collected into Γ_d(C_Q). This step brings in functions that are structurally close to the seeds, even if their semantic similarity is lower.
-
LLM Filtering & Re‑ranking – The union of the original top‑K and the newly discovered neighbors is passed to a large language model (e.g., GPT‑4) with a prompt asking whether each function is likely relevant to the issue. The LLM’s judgments are used to promote under‑ranked but structurally proximate functions and to demote irrelevant ones. Finally, an equal number of the lowest‑scoring original candidates are removed so that the final list still contains K functions.
The method is deliberately lightweight: graph traversal is cheap because the repository graph is sparse, and the LLM is invoked only on a bounded set of candidates (K + |Γ|). Moreover, the approach is language‑agnostic; the only language‑specific step is graph construction, which the authors implement using Python’s ast module for Python and Tree‑sitter for Java, JavaScript, and TypeScript.
To evaluate SpIDER, the authors introduce SpIDER‑Bench, a new benchmark that aggregates issues and ground‑truth edits from three existing corpora—SWE‑PolyBench, Multi‑SWEBench, and SWEBench‑Verified—covering four programming languages. The benchmark provides a heterogeneous graph for each repository, with node features (source code) and edges (contains, invokes, imports, inherits). Statistics show that function‑level edits dominate (≈ 67 % of instances), making function‑level retrieval the most challenging yet most impactful task.
Experiments compare SpIDER against (a) vanilla dense retrieval models fine‑tuned on SWEBench (Fehr 2025, Reddy 2025) and (b) the classic BM25 baseline. Metrics include Recall@K, Mean Reciprocal Rank (MRR), and Mean Average Precision (MAP). Across all languages and datasets, SpIDER consistently outperforms the baselines, achieving an average 13 % improvement in Recall@K. The gains are especially pronounced for multi‑edit issues where relevant functions are clustered in the graph. Ablation studies reveal that each component contributes meaningfully: removing the LLM filter drops performance sharply, reducing the number of seeds C or the hop depth d diminishes the benefit of structural exploration, and using only graph proximity without semantic scores leads to noisy results.
The paper also discusses limitations. Graph construction currently discards files written in secondary languages, potentially missing cross‑language dependencies. The current implementation only exploits ‘contains’ edges; richer relations (invokes, imports, inherits) could further improve recall. Finally, reliance on an external LLM introduces latency and cost considerations for real‑time systems.
Future work suggested includes (i) integrating multiple edge types with learned weights, (ii) dynamic seed selection based on confidence scores, and (iii) caching or distilling LLM judgments to reduce inference overhead.
In summary, SpIDER demonstrates that combining dense semantic embeddings with lightweight graph‑based spatial reasoning and LLM re‑ranking yields a robust, multilingual code retrieval system. It bridges the gap between pure semantic similarity and repository structure, offering a practical solution for next‑generation LLM‑powered coding agents that need accurate, fine‑grained code localization before generating patches or new features.
Comments & Academic Discussion
Loading comments...
Leave a Comment