Locating files and functions requiring modification in large software repositories is challenging due to their scale and structural complexity. Existing LLM-based methods typically treat this as a repository-level retrieval task and rely on multiple auxiliary tools, which often overlook code execution logic and complicate model control. We propose RepoNavigator, an LLM agent equipped with a single execution-aware tool: jumping to the definition of an invoked symbol. This unified design reflects the actual flow of code execution while simplifying tool manipulation. RepoNavigator is trained end-to-end via Reinforcement Learning (RL) directly from a base pretrained model, without relying on closed-source distillation. Experiments demonstrate that RL-trained RepoNavigator achieves state-of-the-art performance, with the 7B model outperforming 14B baselines, the 14B model surpassing 32B competitors, and the 32B model exceeding closedsource models such as GPT-5 on most metrics. These results confirm that integrating a single, structurally grounded tool with RL training provides an efficient and scalable solution for repository-level issue localization.
With the rapid advancement of Large Language Models (LLMs) (Liu et al., 2024a;Team, 2024;Yang et al., 2025a), equipping LLMs with pre-built tools to form LLM agents has become a common paradigm for expanding their capabilities (Shen, 2024;Yuan et al., 2024;Lu et al., 2024). In the domain of software engineering (SWE), although LLM agents can effectively handle simple programming tasks Figure 1: Illustration of a LLM navigating through a code repository. The LLM is equipped with a single yet powerful tool: jump, which is realized through a language server. (Hui et al., 2024;Guo et al., 2024a), their ability to operate on large-scale software repositories remains limited. SWE-BENCH (Jimenez et al., 2023) currently serves as the most comprehensive benchmark for evaluating whether LLMs can resolve real-world GitHub issues. All pretrained LLMs cannot process the whole repository directly due to context limits (Yang et al., 2024b). While SWE-AGENT (Jimenez et al., 2023) provides moderate gains, it remains far from enabling robust repository-level reasoning.
Most existing agents rely on test-time scaling applied directly to pretrained LLMs (Liu et al., 2023;Chen et al., 2025;Schmidgall et al., 2025). In SWE tasks, tools are essential rather than optional: real-world repositories are far larger than the context window of current LLMs, making it impossible to process an entire codebase in a single forward pass. Agents must therefore iteratively invoke tools to retrieve partial information from the repository and interleave natural-language reasoning with tool calls.
Mainstream LLM agents (Chen et al., 2025;Liu et al., 2025;Xiang et al., 2025;Wang et al., 2023;Chen et al., 2024) are rarely exposed to such agentic interaction patterns during pretraining and typically acquire tool usage only through few-shot prompting which is insufficient for learning complex multi-step tool-chaining behaviors. Moreover, because tool definition spaces are effectively unbounded, pretrained models cannot fully internalize their semantics without posttraining. To mitigate these issues, post-training paradigms such as Supervised Finetuning (SFT) (Ma et al., 2025) and Reinforcement Learning with Verifiable Rewards (RLVR) (Yu et al., 2025a;Yue et al., 2025) have been applied, with promising results in domains including retrieval agents (Jin et al., 2025a), GUI agents (Hong et al., 2024), and math agents (Yan et al., 2025).
Directly training an agent to fix software issues, however, remains difficult. A single bug often admits multiple valid patches, making string-level evaluation unreliable. The only precise evaluation method requires executing candidate patches inside a dedicated Docker environment for each repository (Luo et al., 2025), which is prohibitively expensive CPU resource for supervised training. To make training more tractable, we adopt a simplified yet widely generalizable assignment: issue localization. Prior work shows that a software issue becomes substantially easier to resolve once the relevant functions and files are correctly identified (Chen et al., 2025;Ma et al., 2025;Xia et al., 2024;Jiang et al., 2025). Since modern software repositories contain a significant amount of code-far beyond any LLM’s context window-localization drastically reduces the search space and improves downstream solvability. Crucially, localization outputs a discrete set of paths, enabling verifiable, string-level evaluation that is compatible with scalable training frameworks such as SFT and RLVR (string-level evaluation requires little resource).
Existing localization agents (Ma et al., 2025;Chen et al., 2025;He et al., 2025) typically rely on multiple tools, including SearchClass, SearchMethods, and GetImports. Although effective to some extent, these tools considers high-level abstractions (classes, function, etc) of programming languages, which do not reflect how code actually executes. High-level abstractions, such as classes or inheritance, disappear after compilation, leaving only sequential execution and jump operations. Since modern LLMs already excel at modeling sequential dependencies, we focus on enhancing their ability to jump across the repository-that is, to follow and inspect the source definition of symbols as they appear in execution. To this end, we introduce a single, structurally grounded tool: jump, which retrieves the precise definition of a given symbol. Our agent invokes this single tool to navigate in the repository, as depicted in Fig. 1. Details of this tool are provided in Sec. 3.3.
Our main contributions are threefold: (1) We propose the first repo-level localization agent trained on reinforcement learning directly from the pretrained model, regardless of distillation from a closed-source model. (2) We design a repository-navigation agent that operates by performing realistic jump operations aligned with actual execution semantics. (3) We demonstrate that one unified tool significantly improves efficiency and controllability compared to multi-tool pipelines.
LLM agents are promising methods to equip models with complex tools while reasoning (Li et al., 2024;Huang et al., 2024;Guo et al., 2024b). However, because most pretrained LLMs are trained on texts only and developers can define any tools, most tools are out-of-domain (OOD) for LLMs. Even for the most powerful models, failures often happen when calling the new-defined tools due to wrong calling format or failed parameter parsing. Thus, training a LLM to master new-defined tool is critical for LLM agents. Intuitively, the tool-calling trajectories can be generated by a more powerful LLM, and such trajectories can be used to train a student model via supervised finetuning (SFT) (Chen et al., 2025). However, this pipeline requires a stronger teacher model which has capability to master the tool. Recently, more methods have emerged with no teacher-model required. Rejected-sampled finetuning (RFT) (Ahn et al., 2024) utilizes generated trajectories of the agent itself via multiple rollouts. Agentic RL (Jin et al., 2025a) is an onpolicy RLVR methods requiring only the result for verifying trajectories. Such training methods yield remarkable results when the tools are search engines (Jin et al., 2025a), python executer (Jimenez et al., 2023), calculator (Yan et al., 2025), and visual models (Gupta & Kembhavi, 2023).
The introduction of SWE-bench (Jimenez et al., 2023;Yang et al., 2024b) has motivated a range of agentic pipelines for software engineering tasks. Among them, SWE-AGENT (Yang et al., 2024a) and OPENHANDS (Wang et al., 2025a) are widely adopted frameworks that equip agents with tools for interacting with computing environments (terminal in Linux). Workflow-based methods such as Agentless (Xia et al., 2024) decomposes issue resolution into localization, repair, and validation subproblems. PatchPilot (Li et al., 2025) adds reproduction and refinement process in the workflow. CORTEXA (Sohrabizadeh et al.) reflects the repository into an embedding space to enable efficient retrieval. Chen et al. (2025) builds the repository as a graph and applied graph-level searching tools for localization, and Wang et al. (2025a) furthermore integrated commit history as agent memory. RepoLens (Wang et al., 2025b) equip conceptual information of the repository to enable repo-level understanding. These pipelines are training-free, and yield competitive results.
To enable task-specific training, DEEPSWE (Luo et al., 2025) and SWE-SWISS (He et al., 2025) employ reinforcement learning and achieve strong performance. SWE-Gym (Pan et al., 2024) builds a virtual environment to train an agent for this task. However, end-to-end training remains costly because patch evaluation requires executing Docker Figure 2: Overview of our RepoNavigator. During the rollout phase, the agent can invoke the jump tool, and the language server will return the definition code of the symbol. This process is trained by reinforcement learning. containers across numerous repositories, and the Docker container hinders guided search methods due to its nonserializability (Zainullina et al., 2025). Consequently, issue localization has emerged as a computationally efficient alternative, aiming to identify faulty components-at file or function level-rather than generating full patches.
Recent localization agents include LOCAGENT (Chen et al., 2025) and COSIL (Jiang et al., 2025), which model codebases as graphs and integrates them into LLMs, and ORCALOCA (Yu et al., 2025b), which enhances efficiency through priority scheduling, action decomposition, and context pruning. From an open-source perspective, RE-POSEARCHER (Ma et al., 2025), trained with distillation and RL on the Qwen model family (Team, 2024), represents a notable advancement.
Nevertheless, prior agents overlook the structural relations within repositories-where modules, classes, and functions are cross-referenced across files-and typically rely on multiple search tools for symbol definition retrieval, amplifying error propagation (see Sec. 3). In contrast, we employ a single execution-logic-focused tool, reducing usage complexity. Finally, our approach constitutes the first localization agent trained directly from pretrained models, without relying on distillation-based supervised finetuning, a crucial stage in both RepoSearcher (Ma et al., 2025) and LocAgent (Chen et al., 2025).
We present RepoNavigator, a reinforcement-learning agent for repository-level issue localization. The method consists of three components: (1) a unified tool to retrieve the definition of any symbols in a given file, (2) a reasoningaction agent loop that alternates between natural-language reasoning and tool invocation, and (3) a GRPO-based RL algorithm for optimizing long-horizon tool-augmented trajectories. Below we provide the formal problem setting and the detailed method.
Given a repository R = {f 1 , . . . , f N } and an issue description q, the goal is to output relevant code regions Y * = {(f i , g i,j )}, where g i,j denotes a function or code span in file f i . At each step t, the agent produces an optional reasoning step r t , a tool call a t , and receives the observation
RepoNavigator uses a single-tool design to avoid multitool orchestration overhead. At each step the policy ฯ ฮธ decides whether to continue reasoning or to emit a JSONformatted tool call, while a symbol and its corresponding file are parsed to the tool. The agent receives structured observations (code snippets or error messages), then continues reasoning until termination. The loop is reason โ act โ observe (Yao et al., 2022).
Language servers resolve the definition of a Python symbol through a deterministic static analysis pipeline that approximates Python’s runtime name-binding semantics. Given a symbol occurrence s at source location โ, Pyright computes a resolution mapping
where each pair (f i , p i ) denotes a file path and a source position corresponding to a valid definition site of s. In practice, we use file path and symbol to resolve โ. If we have multiple symbols with the same name exist in the same code snippet, we additionally parse an index to the tool, which allows for accurate resolution of โ.
In this process, the source file is parsed into an abstract syntax tree (AST). The syntactic role of s (e.g., name, attribute access, or call expression) determines the subsequent resolution strategy. For attribute expressions a.b, Pyright treats a as a receiver expression whose type must be inferred prior to member lookup.
Lexical Scope Resolution For a name symbol x, candidate definitions are searched along a scope chain:
following Python’s LEGB rule. Each scope maintains a symbol table mapping identifiers to defining AST nodes.
Static Type Inference For attribute symbols, it computes a (possibly union-valued) type t for the receiver expression a using type annotations, assignment flow analysis, function return types, and stub files (.pyi). Member resolution is then defined as:
where MRO denotes the method resolution order.
Import Dependency Graph For cross-file resolution, import dependency graph that statically emulates Python’s module loading semantics is built. Import statements introduce bindings that map local symbols to exported symbols of target modules, including re-exports and all -based filtering. Resolution may therefore traverse multiple modules before reaching a concrete definition.
Given history h t = (q, o 1:t-1 , a 1:t-1 ), the agent samples either a natural-language reasoning step r t โผ ฯ ฮธ (โข|h t ) or a structured tool call a t โผ ฯ ฮธ (โข|h t ). Tool calls must satisfy a JSON grammar enforced via constrained decoding. The loop continues until the agent predicts its final locations ลถ .
We apply reinforcement learning with verifiable rewards to train the agent directly from the pretrained model. We do not require a more powerful teacher model for distillation as a warmup, which is the technique of (Ma et al., 2025).
In practice, we apply Group Reference Policy Optimization (GRPO), which has the loss function:
where the first term is the standard policy gradient objective with an estimated advantage function รt , which promotes actions that lead to higher-than-expected returns. The second term is a Kullback-Leibler (KL) divergence penalty, scaled by a coefficient ฮฒ, which acts as a trust region, preventing the updated policy ฯ ฮธ from moving too far from the previous policy ฯ ฮธold . This formulation ensures stable and consistent policy improvement by balancing reward maximization with behavioral consistency. The advantage รt is calculated in the standard GRPO method, and Algorithm. 1 presents the full RL training process of RepoNavigator.
The reward of GRPO process is calculated as:
Dice is a common metric for set-level comparison, for the predicted set ลถ and the groundtruth set
and S(ฯ ) is the success rate of tool-calling extracted from ฯ . We consider the tool-call to be failed when the format is incorrect, or the symbol parsed does not exist, or for any other reason that causes the tool to quit unexpectedly.
Datasets We extract valid samples from SWE-smith (Yang et al., 2025b) to form the training set. We apply Qwen2.5-7B-Instruct with RepoNavigator to sample each data for 16 times. A sample is abandoned if all 16 scores are zero. For validation, we test our method on SWEbench Verified (Jimenez et al., 2023), which is a humanverified subset of SWE-bench. We additionally test our Metrics Previous works (Chen et al., 2025;Ma et al., 2025) applied recall and precision as metrics. However, because the predicted locations and ground-truth locations are sets of strings, recall and precision singularly cannot reflect the performance fairly. Thus, we utilize sample-F1 (which is the averaged score of per-sample F1 values) and IoU (intersection out of union) as our core metrics. At the same time, we also present the recall and precision scores to align with previous methods, although they do not reflect the methods’ performance fairly.
Training For the 7B model, we conduct GRPO with 8 NVIDIA Tesla-A100-80G GPUs. For the 14B and 32B model, we train it with 16 NVIDIA Tesla-A100-80G GPUs.
We apply verl (Shen, 2024) as the training framework, and we apply vLLM (Kwon et al., 2023) as the inference engine. We train the model for 1 epoch, while the training batch size is fixed to 128 on 4k training samples filtered from SWEsmith, with maximum prompt length and max response length both set to 10240. Additionally, we rollout 8 times for each sample, and the temperature is set to 1.0 to encourage exploration. We use greedy decoding in the inference stage to ensure stable performance. More experimental details are provided in Appendix. B.
Baselines We compare our method against LocAgent (Chen et al., 2025), CoSIL (Jiang et al., 2025), Agentless (Xia et al., 2024), OrcaLoca (Yu et al., 2025b), and RepoSearcher (Ma et al., 2025). LocAgent is distilled from Claude3.5-Sonnet, and RepoSearcher is distilled from Claude3.7-Sonnet. Detailed explanation of baseline methods are presented in Appendix. A. This contributes to the validness of RepoNavigator furthermore. For 14B and 32B models, RepoNavigator achieves SOTA among all training-free methods. This implies that the tool we implement is effective and promising, and our single tool pipeline is better than previous multi-tools pipelines.
Compared with RepoSearcher, which is distilled from claude-3.7-sonnet (Anthropic, 2025) and reinforced by GRPO, trained RepoNavigator outperforms it on all metrices except recall. Moreover, we found that our training-free method outperforms RepoSearcher for 14B models. This is probably due to the simplified tool we integrate to the agent (see Sec. 5 for more details).
To assess the generalizability of RepoNavigator, we present its performance on Python samples from the SWEbench Pro dataset (Yang et al., 2025b) in Table 2. The results on this dataset are consistent with those observed on SWE-bench Verified. While we cannot fully exclude the potential influence of data leakage in SWE-bench Verified, we can make a stronger claim regarding SWE-bench Pro, as it was released after the publication of the Qwen2.5 series.
We investigate the capabilities of GRPO and our proposed hybrid reward function within the context of agentic training, using SFT as a primary benchmark. Experimental results presented in Fig. 3 indicate that our method significantly surpasses both SFT and GRPO with a vanilla reward function, validating the critical role of RL and hybrid reward mechanisms in enhancing agentic capabilities. While (Ma et al., 2025) emphasizes the necessity of an initial SFT phase, our ablation study (detailed in Appendix C) reveals the potential of training fully autonomous agents directly through RL. This observation challenges the requirement for a “cold start” and resonates with a growing body of literature advocating for RL-centric training paradigms (Liu et al., 2024a;Kirk et al., 2023;Jin et al., 2025b).
We also remove the success rate in the reward function for ablation. As presented in Fig. 3, reinforcement learning with hybrid reward (with tool-calling success rate) has higher performance than pure outcome reward (without tool-calling success rate). This indicates that learning to correctly call tools is vital in agentic learning.
To assess the significance of tool-calling in RepoNavigator, we varied the maximum number of tool-calling turns and reported the results in Fig. 4. As shown in the figure, allowing more tool-calling turns consistently leads to improved performance for RepoNavigator, both before and after reinforcement learning (RL) training. These results empirically validate the scaling law of tool-calling in this context.
To evaluate the impact of different localization results on the final issue resolution performance, we test RepoNavigator against baselines on SWE-bench Verified. We directly apply the repairing phase of Agentless while replacing its localization front-end with other methods. the highest performance on issue resolution, while reinforcement learning improves its performance furthermore.
In this section, we analyze the logic behind RepoNavigator: building less tools with more powerful and more ensembled functions is more effective than building multiple task-specific tools. Compared with the repository scope, the access scope has a much higher IoU with the groundtruth set.
Let the total number of available tools be denoted as k.
When only a single tool-specifically the jump tool-is retained, the system’s structural relations become simpler, as both the action space and the observation space are restricted to what this tool can access. In this case, the set of possible actions and observable elements is smaller than when multiple tools are available. This reduction is generally beneficial, since additional tools often introduce new and unfamiliar interfaces that large language models have not been exposed to during pretraining.
For a given process in issue localization (for instance, checking the code snippet of a function), let the success probability of the i-th call be p i . For a task that requires k sequential tool calls, the overall success rate can be expressed as:
Since each step introduces an additional potential point of failure, the cumulative success rate typically decreases as the number of required tool calls increases. Therefore, in general, completing a task with a single, more versatile tool tends to be more reliable than relying on multiple narrowscope tools executed in sequence.
The access scope of a tool is defined as the complete set of files, symbols, and other resources that the tool can access within a repository. For a jump tool that navigates to symbol definitions, its access scope can be obtained by starting from a given entry point and recursively resolving all referenced symbols until no new definitions can be reached. Apparently, its access scope is significantly smaller than the full repository scope. Consequently, when computing the Intersection over Union (IoU) between the prediction set and the groundtruth set, using the jump tool results in a higher IoU, as depicted in Fig. 5. On the other hand, applying multiple repo-level retrieval tools results in the access scope equal to the whole repository scope.
Starting from a designated entry point, the recursive application of the jump operation-which resolves the definitions of all referenced symbols-effectively constructs a transitive closure of all semantically activated symbols. Since any code location contributing to a specific issue must inherently reside on a dependency path originating from the entry point, it is logically reachable through this recursive expansion of symbol references. Consequently, the final access scope generated by an exhaustive jump traversal is guaranteed to encompass all locations requiring modification to resolve the issue, excluding only those introduced by dynamic, runtime changes to the repository (e.g. monkey-patching and string-based imports).
To further verify this proposal, we change the tool set of Re-poNavigator and conduct RL training with only the outcome reward. We add excessive tools which were frequently used in previous works (Chen et al., 2025;Ma et al., 2025;Jiang et al., 2025) and present the result in Table . 4. Due to page limit, we present the full result in Table . 6. GetClass/Get-Func takes a class/function name as input and outputs the class/function definition. GetStruc takes no input and outputs the repository’s structure in the file system. The results clearly implies that additional tools do not increase model’s performance. When one tool is capable enough, more tools does not mean more benefits. This inspires researchers to develop less but more capable tools.
In this work, we introduced RepoNavigator, a repositorylevel issue localization agent that departs from existing multi-tool paradigms by leveraging a single, more-capable jump tool for symbol resolution. This unified design faithfully reflects real code execution flow while significantly reducing the complexity and brittleness of multi-step tool chaining. Through tool-integrated GRPO, RepoNavigator learns to reason, invoke tools, and refine its predictions in a closed-loop manner, enabling end-to-end optimization without relying on closed-source teacher models or distillation.
Extensive experiments across SWE-bench Verified and SWE-bench Pro demonstrate that RepoNavigator achieves state-of-the-art localization performance. We theoretically analyze the results, confirming that a single powerful tool, jointly optimized with reinforcement learning, can provide stronger robustness and more reliable multi-step reasoning than previous frameworks relying on multiple narrowly scoped tools. Future work will explore extending RepoNavigator from Python to more programming languages.
Benchmarks The SWE-bench (Jimenez et al., 2023) benchmark series is designed to evaluate the capability of Large Language Models (LLMs) in resolving real-world GitHub issues. In the official evaluation protocol, models are provided solely with the problem statement and the historical repository state to generate a patch. However, in practical software engineering workflows, developers typically have access to at least one entry point for any given issue. To better simulate these real-world scenarios and enhance evaluation fidelity, we deviate from the official SWE-bench standard by providing an explicit entry point for each sample. Specifically, for both our proposed method and all baseline models, we supply the precise file path and the corresponding entry function.
Tools Description Table . 5 illustrates the detailed input and output of RepoNavigator and baselines (except Agentless, which it is workflow-based). CoSIL, OrcaLoca, and RepoSearcher applies simple retrieval tools which overlook the structural information of the repository. Among all methods, only RepoNavigator reduces the number of tools to one.
Hyperparameters for RL We set clip ratio low to 0.2, clip ratio high to 0.8, learning rate to 10 -6 , training batch size to 128, temperature to 1.0, maximum toolcalling to 12, and max response length to 10240.
For baseline methods, we use vLLM (Kwon et al., 2023) to deploy the corresponding model locally, and we set max respons length to 32768, enforce-eager to True. We deploy all models on 8 NVIDIA Tesla-A100-80G GPUs. We follow the default settings of vLLM for other hyperparameters.
Metrics Given the set of predicted locations (ether file-level or function-level) ลถ , and the set of groundtruth locations Y * , the aforementioned metrics are calculated as:
In practice, when the prediction set ลถ is empty (for instance, total failure), we set recall, precision, sample-F1, and IoU to zero. We use the function-level localization result of different methods and apply the patch generation backend in Agentless (Xia et al., 2024) to generate patches. Resolved(%) denotes the percentage of samples that pass all test units after applying the patch.
Implementation When the response exceeds the maximum length, we clip and force the agent to stop, and we give zero as its score. When the agent exceeds the maximum tool-calling times (which is 12), we add “You must not call tools anymore, and you must give the final answer” to the tool’s response. Most of the time, the agent will stop calling tools and generate the final response. If not, we force it to stop and give zero as its score. Note that when the maximum tool-calling times is not achieved and the final answer is generated, the agent loop will stop automatically. The aforementioned process is an automatic agentic framework, which allows the agent to explore in the environments with little constraints.
Preventing Data Leakage It is a widespread concern that data leakage at the pre-training phase threatens the validity of post-training methods. Nevertheless, we exclude this concern by results in Tabel. 2. The SWE-bench Pro dataset was published in 2025, while the Qwen2.5 series were published in 2024. Moreover, we exclude the samples in the training dataset if the repository also appears in SWE-bench Verified or SWE-bench Pro.
Language Server In practice, we apply a Python language server to extract the definition code corresponding to an invoked symbol within a repository. However, the presence of monkey patches-runtime modifications to the repository-and dynamic imports can degrade the performance of the language server, as its functionality relies on static analysis techniques such as abstract syntax trees and symbol tables. When such circumstances occur, the tool returns an error message indicating that the definition of the current symbol cannot be located due to unknown reasons. Nevertheless, in our empirical evaluation, we did not observe any instances of monkey patching or dynamic imports within the analyzed datasets.
Ablation Study We investigate the impact of the hybrid reward mechanism in Fig. 3 and explore the influence of SFT as a “cold start” phase for the Reinforcement Learning (RL) process. As illustrated in Table 7, our findings suggest that while SFT enhances initial model performance, it ultimately constrains the long-term potential of RL. By forcing the model to adhere to pre-generated trajectories, SFT improves basic task proficiency at the expense of autonomous exploration. Consequently, the results in Table 7 demonstrate that training the model directly with reinforcement learning yields the most superior performance.
Cost Analysis We conduct a comparative analysis of token usage across multiple methodologies, utilizing Qwen2.5-14B-Instruct as the base model on SWE-bench Pro. Empirical results (see Table . 8) indicate that RepoNavigator represents a superior agentic design in terms of cost-efficiency. While baseline approaches such as CoSIL and RepoSearcher incur substantial overhead by indexing all repository files-leading to significantly higher average input tokens-our proposed method optimizes token consumption by avoiding such exhaustive operations. Evolution of Tool-calling Proficiency We evaluate the tool-calling performance during training in Fig. 6, where tool rewards consistently improve. A potential side effect of integrating success rates into the GRPO reward function is the emergence of “conservative” agents that inflate success metrics by suppressing tool usage. However, the results in Fig. 6b and Fig. 6c mitigate this concern, showing that successful tool-calling turns increase while overall turns remain stable or rise. We observe a particularly pronounced increase in total turns for the 7B model (Fig. 6b). Unlike the 14B and 32B models, which leverage their superior reasoning to reach targets with fewer, more precise invocations, the 7B model undergoes a more intensive exploration phase due to its initially lower tool-mastery (Fig. 6a).
Aside from open-source baselines we reproduced in our research, there are also agentic designs for repo-level tasks. PAGENT (Xue et al., 2025) analyzes the failure reasons of existing agents, and applies CFG creation and exploration to infer the type of a patch. MarsCode (Liu et al., 2024b) is a training-free multi-agent framework including a manager, a searcher, a tester, a reproducer, and editors. Trae (Gao et al., 2025) is a training-free workflow composed of patch generation, regression test, patch pruning and selection. Among all agentic frameworks, Trae has the highest rank on SWE-bench leaderboard. Linxi (Yang et al., 2025c) is a training-free multi-agent framework while adding the repository as a knowledge database, and retrieval-based methods are applied to extract knowledge from this database. These workflows are sophisticated, and they successfully integrate repo-level tools including language server, terminal, and search engine. However, their complexity hinders their potential for training. On the other hand, we remove unnecessary tools and workflows, preserving only the jump tool (built upon language server) to build a fully-automatic agent. Our design enables efficient training with reinforcement learning.
We present the detailed pseudo code of the RL training process of RepoNavigator in Algorithm. 1. For simplicity, we include the entry point (normally a testing function) in issue q.
Algorithm 1 Reinforcement learning process of RepoNavigator using GRPO Require: Repository R, issue q, initial policy ฯ ฮธ , group size G Ensure: Output ลถ maximizing E ฯ โผฯ ฮธ [R(ฯ )] 1: o 0 โ q, ฯ โ โ
2: while not terminated do 3:
h t โ (q, o 1:t-1 , a 1:t-1 )
Optionally generate intermediate reasoning 5:
if x t is a tool-call then 7: end if 14: end while 15: Sample a group of trajectories {ฯ (g) , ลถ (g) } G g=1 using ฯ ฮธ 16: Compute rewards {R
for each (h t , a t ) โ ฯ (g) do 21:
Assign advantage รt โ ร(g)
end for 23: end for 24: Estimate GRPO loss L GRPO (ฮธ) using Eq. 4
Groundtruth Retrieval A limitation of our work lies in the extraction of groundtruth locations. We extract modified locations directly from the gold patch in the datasets, which may ignore other patches that also resolve the issue. Our evaluation metrics do not take these correct alternatives into consideration. However, using golden patches is acceptable when comparing mutliple methods. If a method reveals golden locations (locations in golden patches), it undoubtedly contributes to the resolution of the issue, and the result in Table . 3 demonstrates this claim.
Language Limit Another limitation is that we only evaluate Python repositories in our experiments. This is because each language (C/C++, Java, etc.) has its unique language server, and we only succeed in implementing the language server of python. We will implement more language servers and validate our approach on more programming languages in the future.
In this section, we present the full trajectory of RepoNavigator on astropy astropy-12907 from SWE-bench Verified in Fig. ??. We apply the default tool-calling prompt template of verl (Shen, 2024) and present an example. The system prompt and user prompt are presented in Fig. ??. Noted, we do not present any process restrictions in our prompt, encouraging RepoNavigator to plan, call tools, and make decisions full-automatically. This is distinct with Agentless (which has a fixed workflow), LocAgent (which predefines a specific step-by-step workflow in its system prompt), CoSIL and RepoSearcher (which is half-automatic because some forced steps are added to the workflow besides the automatic multi-turns tool-calling conversations).
The following pages are prompt and case study on astropy astropy 12907 of SWE-bench Verified.
[system] You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
You may call one or more functions to assist with the user query.
You are provided with function signatures within XML tags: {“type”: “function”, “function”: {“name”: “jump”, “description”: “In the specific file path, a symbol is referred and this tool can find where the tool is defined. For instance, in the first turn, file_path is the entry point of.”, “parameters”: {“properties”: {“symbol”: {“description”: “The symbol whose definition code will be given to the agent.”, “type”: “string”}, “file_path”: {“description”: “The relevant path to the file where the symbol is referred.”, “type”: “string”}}, “required”: [“symbol”, “file_path”], “type”: “object”}}} For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags: <tool_call> {“name”: , “arguments”: } </tool_call> [user] You are given a codebase and an issue, you need to locate the files and functions causing this issue. You can call the tool to check the definition code of a symbol. You can only check the symbol once for each turn. The “file_path” is the relevant path of where the symbol is called, NOT where it is defined! For instance, if “classA.functionB” is what you want to check (which is called in fileA.py), you should directly check “functionB” in “fileA.py”.
The entry file of the code base is:
Your final answer should be all functions that should be modified, such as: relevant/path/to/file1.py::func_name1,relevant/path/to/file2.py::func_name2, …(a series of file::function pairs seperated by comma) Please put your final answer inside \boxed{} only in the last turn. You can only call the tool once each turn. For instance: {“name”: “jump”, “arguments”: {“symbol”: “symbol_to_be_checked”, “file_path”: “file_where_the_symbol_is_used”}} User-turn-0 …(system prompt and problem statement are omitted)… The entry file of the code base is: astropy/modeling/tests/test_separable.py: def test_separable(compound_model, result): assert_allclose(is_separable(compound_model), result[0]) assert_allclose(separability_matrix(compound_model), result [1])
Your final answer should be all functions that should be modified, with the format of: relevant/path/to/file1.py::func_name1,relevant/path/to/file2.py::func_name2,… The “separability_matrix” function is defined in “astropy/modeling/separable.py” Given the issue with nested CompoundModels, it seems the problem might lie in how the function handles nested models. Specifically, the function might not correctly propagate the separability information through nested models.
To address this, we need to check the “_separable” function, which is likely where the separability logic is implemented. Let’s check the “_separable” function. <tool_call> {“name”: “jump”, “arguments”: {“symbol”: “_separable”, “file_path”: “astropy/modeling/separable.py”}} </tool_call> The “_separable” function is defined in “astropy/modeling/separable.py”. This function is responsible for calculating the separability of outputs for a given model. Given the issue with nested CompoundModels, it seems the problem might lie in how the function handles the combination of separability matrices for nested models.
To address this, we need to ensure that the separability information is correctly propagated through nested models. Specifically, we need to check how “_operators” dictionary handles the combination of separability matrices.
Let’s check the “_operators” dictionary. <tool_call> {“name”: “jump”, “arguments”: {“symbol”: “_operators”, “file_path”: “astropy/modeling/separable.py”}} </tool_call> User-turn-6 <tool_response> The definition of symbol “_operators” is: _operators = {"&": _cstack, “|”: _cdot, “+”: _arith_oper, “-”: _arith_oper, " * “: _arith_oper, “/”: _arith_oper, " ** “: _arith_oper} It is defined in: astropy/modeling/separable.py You can check other symbols in previous files, or you can give the final answer.
and are built on real-world Github issues. We provide the strict definition of metrics in Appendix. B.
School of Computer Science, Peking University
Zhongguancun Academy
Baidu Inc. Correspondence to: Yitong Duan duanyitong@zgci.ac.cn, Yunfang Wu <wuyf@ pku.edu.cn>.
This content is AI-processed based on open access ArXiv data.