FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current boundaries of their coding abilities. Existing agentic coding benchmarks, however, cover a limited task scope, e.g., bug fixing within a single pull request (PR), and often rely on non-executable evaluations or lack an automated approach for continually updating the evaluation coverage. To address such issues, we propose FeatureBench, a benchmark designed to evaluate agentic coding performance in end-to-end, feature-oriented software development. FeatureBench incorporates an execution-based evaluation protocol and a scalable test-driven method that automatically derives tasks from code repositories with minimal human effort. By tracing from unit tests along a dependency graph, our approach can identify feature-level coding tasks spanning multiple commits and PRs scattered across the development timeline, while ensuring the proper functioning of other features after the separation. Using this framework, we curated 200 challenging evaluation tasks and 3825 executable environments from 24 open-source repositories in the first version of our benchmark. Empirical evaluation reveals that the state-of-the-art agentic model, such as Claude 4.5 Opus, which achieves a 74.4% resolved rate on SWE-bench, succeeds on only 11.0% of tasks, opening new opportunities for advancing agentic coding. Moreover, benefiting from our automated task collection toolkit, FeatureBench can be easily scaled and updated over time to mitigate data leakage. The inherent verifiability of constructed environments also makes our method potentially valuable for agent training.


💡 Research Summary

The paper “FeatureBench: Benchmarking Agentic Coding for Complex Feature Development” introduces a novel benchmark designed to evaluate the capabilities of Large Language Model (LLM)-powered coding agents in complex, feature-oriented software development scenarios. The authors identify significant limitations in existing agentic coding benchmarks like SWE-bench, GitTaskBench, and PaperBench. These limitations include a narrow focus on bug-fixing within single pull requests (PRs), reliance on non-executable or manually crafted evaluations, and a lack of automated, scalable methods for task collection, which hinders continuous updates and expansion of evaluation coverage.

To address these gaps, FeatureBench is proposed with two core innovations: an execution-based evaluation protocol and a scalable, test-driven instance collection toolkit. The benchmark tasks require agents to develop callable features based on high-level descriptions and explicit interface definitions (including import paths and function signatures), either by extending an existing codebase (Level 1 difficulty) or implementing from scratch (Level 2 difficulty). Success is determined automatically by executing associated unit tests, categorized as Fail-to-Pass (F2P) and Pass-to-Pass (P2P) tests, following the established protocol of SWE-bench.

The automated collection pipeline is the technical cornerstone of FeatureBench. It starts with a Python repository, sets up its environment using Docker, and identifies valid test files via pytest. For each benchmark instance, it selects F2P tests (which fail before the feature is implemented) and P2P tests (which should pass before and after). The system then performs dynamic tracing during test execution to construct a runtime object dependency graph. Using graph traversal algorithms, it isolates the exact code implementing the target feature while rigorously verifying the integrity of the rest of the codebase through a post-verification step. This process automatically generates the “pre-solved” codebase (without the feature) and the corresponding “gold patch” (the feature implementation), along with a synthesized problem statement. This methodology is independent of historical commit or PR trajectories, allowing for the creation of realistic feature-development tasks that often span multiple, scattered contributions.

Using this toolkit, the authors curated the first version of FeatureBench, comprising 200 challenging evaluation tasks and 3825 executable environments derived from 24 popular open-source Python repositories (e.g., from Hugging Face’s Transformers library). The tasks were created between May 2022 and September 2025, and the pipeline supports continual updates with post-training-date tasks to mitigate data leakage concerns.

Empirical evaluations on FeatureBench reveal a substantial gap in current agent capabilities. While state-of-the-art models like Claude 4.5 Opus achieve a 74.4% resolved rate on SWE-bench, their performance plummets to only 11.0% on FeatureBench. Similarly, GPT-5.1-Codex (medium reasoning) resolves only 12.5% of tasks. This stark contrast highlights that contemporary coding agents, while proficient at localized bug fixes, struggle significantly with the broader planning, integration, and implementation challenges inherent in end-to-end feature development. The benchmark thus opens new avenues for research and advancement in agentic coding. The paper concludes by emphasizing FeatureBench’s role in providing a more comprehensive, scalable, and automatically updatable testbed for pushing the boundaries of what LLM-powered agents can achieve in realistic software engineering contexts.


Comments & Academic Discussion

Loading comments...

Leave a Comment