ContextBench: A Benchmark for Context Retrieval in Coding Agents

ContextBench: A Benchmark for Context Retrieval in Coding Agents
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

LLM-based coding agents have shown strong performance on automated issue resolution benchmarks, yet existing evaluations largely focus on final task success, providing limited insight into how agents retrieve and use code context during problem solving. We introduce ContextBench, a process-oriented evaluation of context retrieval in coding agents. ContextBench consists of 1,136 issue-resolution tasks from 66 repositories across eight programming languages, each augmented with human-annotated gold contexts. We further implement an automated evaluation framework that tracks agent trajectories and measures context recall, precision, and efficiency throughout issue resolution. Using ContextBench, we evaluate four frontier LLMs and five coding agents. Our results show that sophisticated agent scaffolding yields only marginal gains in context retrieval (“The Bitter Lesson” of coding agents), LLMs consistently favor recall over precision, and substantial gaps exist between explored and utilized context. ContextBench augments existing end-to-end benchmarks with intermediate gold-context metrics that unbox the issue-resolution process. These contexts offer valuable intermediate signals for guiding LLM reasoning in software tasks.


💡 Research Summary

ContextBench is introduced as a process‑oriented benchmark that evaluates how large‑language‑model (LLM) based coding agents retrieve and use code context while solving real‑world software issues. Existing benchmarks such as SWE‑bench and SWE‑bench Pro focus solely on final task success (e.g., Pass@k), ignoring the intermediate steps where agents must locate relevant files, functions, or lines in a repository. To fill this gap, the authors curated 1,136 issue‑resolution tasks drawn from 66 open‑source repositories spanning eight programming languages (Python, Java, JavaScript, TypeScript, Go, Rust, C, C++). For each task, expert developers produced a “gold context” – a compact, verified set of code artifacts that are necessary and sufficient to fix the issue.

The construction pipeline consists of three stages. First, tasks are aggregated from four public benchmarks (SWE‑bench Verified, Multi‑SWE‑bench, SWE‑PolyBench PB500, SWE‑bench Pro) and deduplicated using rule‑based metadata matching and embedding‑based similarity detection, yielding 3,100 unique tasks. Second, difficulty metrics – agent solvability, edit scope, and edit dispersion – are computed to rank tasks; the authors then manually prune the list to 1,136 challenging yet meaningful instances. Third, a human‑in‑the‑loop annotation process iteratively traces code dependencies from the ground‑truth patches, validates contexts by prompting a state‑of‑the‑art LLM (GPT‑5) to generate patches using only the context, and refines the annotations for compactness and inter‑annotator agreement. The resulting gold contexts cover 522,115 lines, 23,116 classes/functions, and 4,548 files.

To evaluate agents, the authors instrument the agents’ execution traces, recording every file, AST block, and line the agent inspects. Using Tree‑sitter to parse repositories, they map both the agent‑retrieved regions and the gold contexts onto a shared coordinate system. Recall, precision, and F1 are then computed at three granularities (file, block, line) via interval overlap. This framework enables dynamic, stage‑wise assessment of context retrieval efficiency and effectiveness, complementing traditional end‑to‑end success metrics.

Four frontier LLMs (GPT‑5, Claude Sonnet 4.5, Gemini 2.5 Pro, Devstral 2) and five coding agents (mini‑SWE‑agent, SWE‑agent, OpenHands, Agentless, Prometheus) are benchmarked. Key findings include: (1) sophisticated retrieval scaffolding (multi‑step prompting, external search integration) does not guarantee superior context retrieval; simple baselines often perform comparably, echoing the “Bitter Lesson” that raw compute outweighs engineered tricks. (2) All LLMs prioritize recall, pulling in large swaths of code to maximize coverage, which leads to low precision and higher token consumption. (3) Models that strike a balance between recall frequency and granularity achieve higher Pass@1 scores while reducing cost, indicating that efficient context selection is beneficial. (4) A substantial gap exists between retrieved and actually utilized context: agents may explore gold‑relevant code but fail to incorporate it into the final patch, highlighting context consolidation as a critical bottleneck.

Overall, ContextBench provides the first large‑scale, human‑annotated dataset and evaluation suite for measuring intermediate context‑retrieval behavior of coding agents. By exposing how agents search, filter, and apply code context, it offers valuable signals for designing better prompting strategies, memory mechanisms, and retrieval‑generation feedback loops. The benchmark is publicly released (https://contextbench.github.io/) and includes a “Lite” subset of 500 tasks for rapid prototyping. Future work can extend the benchmark to more languages, larger repositories, and richer tooling (e.g., dynamic analysis) to further close the gap between agent reasoning and reliable software development.


Comments & Academic Discussion

Loading comments...

Leave a Comment