IDE-Bench: Evaluating Large Language Models as IDE Agents on Real-World Software Engineering Tasks

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

IDE-Bench is a comprehensive framework for evaluating AI IDE agents on real-world software engineering tasks through an IDE-native tool interface. We present a Dockerized test harness that goes beyond raw terminal execution, granting models a structured tool ecosystem that represents AI-native IDEs like Cursor and Windsurf. By providing high-level abstractions for codebase search, structured file editing, and tools for testing full-stack applications, IDE-Bench evaluates an agent’s ability to act as a true engineering collaborator. For evaluation and to prevent training data contamination, we created 80 tasks across eight never-published repositories spanning C/C++, Java, and MERN stacks, representing modern tech stack production scenarios, including feature implementation, bug fixing, refactoring, and performance optimization that mirror daily developer workflows in private codebases. Our benchmark is the first to systematically correlate agent-reported intent with successful project-level modifications in a multi-language, full-stack environment on completely uncontaminated code. We release IDE-Bench and a public leaderboard at: https://ide-bench.com.

💡 Research Summary

IDE‑Bench introduces a comprehensive framework for evaluating large language models (LLMs) that act as IDE agents on realistic software engineering tasks. Unlike prior benchmarks such as SWE‑Bench, SWE‑Bench Verified, or Terminal‑Bench, IDE‑Bench is built around an “IDE‑native” tool interface that mirrors the capabilities of modern AI‑enhanced IDEs like Cursor and Windsurf. The benchmark provides a Docker‑based test harness where each model can invoke 17 distinct functions—ranging from file system operations (read_file, edit_file, list_dir) and codebase search to execution of terminal commands, API calls, and MongoDB queries. Each function call must include an explanatory parameter, encouraging explicit reasoning and enabling fine‑grained trajectory analysis.

To eliminate training‑data contamination, the authors curated eight completely unpublished repositories spanning C/C++, Java, TypeScript, Python, and full‑stack MERN applications. Each repository contains ten tasks that reflect real‑world developer workflows: feature implementation from underspecified specifications, multi‑file bug fixing, backward‑compatible refactoring, and performance optimization under production constraints. Every task directory includes a human‑written description (task description.txt), a reference patch (task diff.txt), and an automated test suite (tests.py). The reference diff is hidden from the model at evaluation time, forcing the agent to explore, edit, and test the code base autonomously.

The evaluation pipeline proceeds in three stages. First, a fresh Ubuntu 24.04 Docker container is instantiated from the repository’s Dockerfile, ensuring reproducibility and isolation. Git is configured for precise change tracking. Second, the model is launched via a LiteLLM‑based “Gladiator” agent with a comprehensive system prompt that lists all tool specifications and the task goal. The agent may iteratively call any of the 17 tools up to a 100‑step limit; all calls, responses, and explanations are logged for later analysis. Security constraints prevent the agent from reading test files or the hidden golden solution. Third, after the agent finishes (or the iteration limit is reached), the benchmark runs ./run_tests.sh, compares the resulting git diff against the hidden reference, and computes a suite of metrics: pass@1, pass@5 (the fraction of tasks solved in at least one of one or five independent attempts), per‑test pass rates, token consumption, an efficiency score (pass@5 ÷ tokens × 10³), and tool‑usage statistics.

Experiments covered 15 models (including GPT‑5.2, Claude Sonnet 4.5, Claude Haiku 4.5, Claude Opus 4.5, GPT‑5.1 Codex Max, Gemini 3 Pro Preview, Qwen3 Max, Qwen3 Coder, DeepSeek V3.2, Grok 4.1 Fast, among others) across 6,000 runs (80 tasks × 5 attempts × 15 models). GPT‑5.2 achieved the highest overall performance with 95 % pass@5, while the Claude family and GPT‑5.1 Codex Max clustered in the 85–89 % pass@5 range. Gemini 3 Pro Preview reached 80 % pass@5, Qwen3 models hovered around 75 %, and DeepSeek V3.2 attained 71 % pass@5. Lower‑tier models struggled to exceed a 50 % resolution rate, highlighting a clear gap between models capable of functioning as autonomous IDE agents and those that cannot.

First‑attempt success (pass@1) revealed a similar hierarchy: Claude Sonnet 4.5 led with 87.5 % pass@1, followed closely by GPT‑5.2 (85 %) and Claude Opus 4.5 (83.75 %). This metric is especially relevant for production deployments where API latency or cost limits the number of retries.

Beyond binary pass@k, the authors performed per‑test analyses and uncovered a “near‑miss” phenomenon: several models correctly implemented core functionality but failed a small subset of tests due to output formatting or edge‑case handling. For example, on the Event Callback System task‑4, Claude Sonnet 4.5, Gemini 3 Pro Preview, and Claude Opus 4.5 each achieved a 91.7 % test pass rate yet scored 0 % pass@5 because a single formatting error caused the overall task to be marked as failed. Similar patterns appeared in the Cross‑Lingual Document Translator task. These findings suggest that in real development environments, a high overall pass rate may still mask brittle issues that require manual correction.

Token efficiency varied widely. Grok 4.1 Fast was the most token‑efficient model (67.5 % pass@5 with 182 k tokens per successful task, efficiency = 0.37), while Claude Opus, despite strong coverage (86.25 % pass@5), consumed 1.354 M tokens per success. GPT‑5.1 Codex Max achieved a favorable balance (85 % pass@5 with 282 k tokens per success, efficiency = 0.30). DeepSeek V3.2 and Gemini 3 Pro Preview required over 1 M tokens per success, indicating a “slow‑but‑thorough” refinement style that generates many intermediate patches without converging quickly. The authors propose a two‑tier deployment architecture: a fast, token‑efficient model attempts the task first; if it fails, a more thorough, higher‑cost model takes over.

In summary, IDE‑Bench provides the first systematic, tool‑enabled benchmark that evaluates LLMs as true IDE collaborators across multi‑language, full‑stack projects while safeguarding against data contamination. It captures not only whether a model can produce a correct patch, but also how efficiently it uses IDE tools, how many iterations it needs, and how robust its output is to formatting and edge‑case constraints. The benchmark’s rich set of metrics and open leaderboard (https://ide‑bench.com) give researchers and practitioners a nuanced view of model suitability for real‑world coding assistance, and lay the groundwork for future extensions such as richer human‑AI interaction studies, additional technology stacks, and more complex multi‑agent collaboration scenarios.

IDE-Bench: Evaluating Large Language Models as IDE Agents on Real-World Software Engineering Tasks

💡 Research Summary

Comments & Academic Discussion

Leave a Comment