Towards a Benchmark for Dependency Decision-Making
AI coding agents increasingly modify real software repositories and make dependency decisions, including adding, removing, or updating third-party packages. These choices can materially affect security posture and maintenance burden, yet repository-level evaluations largely emphasize test passing and executability without explicitly scoring whether systems (i) reuse existing dependencies, (ii) avoid unnecessary additions, or (iii) select versions that satisfy security and policy constraints. We propose DepDec-Bench, a benchmark for evaluating dependency decision-making beyond functional correctness. To ground DepDec-Bench in real-world behavior, we conduct a preliminary study of 117,062 dependency changes from agent- and human-authored pull requests across seven ecosystems. We show that coding agents frequently make dependency decisions with security consequences that remain invisible to test-focused evaluation: agents select PR-time known-vulnerable versions (2.46%) and exhibit net-negative security impact overall (net impact -98 vs. +1,316 for humans). These observations inform DepDec-Bench task families and metrics that evaluate safe version selection, reuse discipline, and restraint against dependency bloat alongside test passing.
💡 Research Summary
The paper addresses a growing blind spot in the evaluation of AI‑driven coding agents: their decisions about third‑party dependencies. While modern software heavily relies on packages from public registries such as NPM, PyPI, and Maven Central, these dependencies are a primary attack surface and a source of long‑term maintenance cost. Existing repository‑level benchmarks focus almost exclusively on functional correctness—compilation success, passing tests, or execution—treating changes to manifest files as incidental. The authors argue that dependency decision‑making (reusing existing libraries, avoiding unnecessary additions, and selecting safe versions) should be treated as a first‑class capability and measured explicitly.
To ground their proposal, the authors conduct an empirical study on the AIDev‑pop dataset, which contains 33,596 agent‑authored and 6,618 human‑authored pull requests (PRs) across 2,807 popular GitHub repositories spanning seven ecosystems. They extract 117,062 dependency changes (additions, removals, and version updates) from manifest files. The analysis shows that agents make dependency decisions in 45 % of their PRs, with 25.5 % of changes being version updates—significantly higher than the 15.8 % update rate for humans. More importantly, agents introduce known‑vulnerable versions at a higher rate (2.46 % vs. 1.64 %). In 86.58 % of those vulnerable selections, a patched, safe version was already available at PR time, indicating that agents often ignore readily fixable security issues.
The security impact is quantified using a “net impact” metric (vulnerabilities introduced minus those fixed). Agent‑authored PRs have a net impact of –98 (i.e., they cause more new vulnerabilities than they fix), whereas human PRs achieve a net positive impact of +1,316. Moreover, 36.8 % of agent‑introduced vulnerable versions require a major version bump to reach a safe release, compared with only 12.9 % for humans, suggesting higher future remediation effort for agent‑generated code.
Motivated by these findings, the authors propose DepDec‑Bench, a benchmark specifically designed to evaluate dependency decision‑making. DepDec‑Bench defines two evaluation tracks: (1) a policy‑specified track where the prompt includes explicit dependency constraints (allow‑lists, deny‑lists, version policies), and (2) a policy‑unspecified track where agents must infer sensible decisions without explicit guidance. Four task families are introduced:
- Reuse‑available – an appropriate library already exists; the solution should reuse it rather than add a new package.
- Justified‑add – a new capability is required; adding a dependency is permissible but must respect an allow‑list and version policy.
- Avoid‑unnecessary – adding a dependency is possible but unjustified; the solution should rely on the standard library or existing code.
- Policy‑safe‑selection – a version change is needed; the agent must avoid PR‑time known‑vulnerable or deny‑listed releases and prefer safe alternatives available at the reference date.
Each benchmark instance provides a pinned repository snapshot, a task prompt (optionally with failing test output), and a test suite. Agents produce a unified patch that may modify source code and manifest files. Evaluation metrics include:
- PR‑time safety & policy compliance – avoidance of vulnerable or deny‑listed versions when safe alternatives exist.
- Decision discipline – degree of reuse of existing dependencies versus unnecessary additions.
- Remediation disruption – quantifying the effort required to remediate unsafe selections (e.g., major version jumps).
The authors outline a roadmap for constructing the benchmark: initial manual labeling of safe/unsafe decisions, definition of policy rules, integration of vulnerability databases (e.g., OSV, NVD) for PR‑time checks, and eventual automation of scoring pipelines. They emphasize that some tasks will be inherently subjective; therefore, DepDec‑Bench will support both strict (objective) and relaxed (subjective) scoring modes.
In conclusion, the paper demonstrates that AI coding agents frequently make dependency decisions with hidden security and maintenance consequences, and that current evaluation practices fail to capture these effects. DepDec‑Bench fills this gap by providing a systematic, reproducible framework to assess agents not only on functional correctness but also on the quality of their dependency choices, encouraging the development of more security‑aware and maintenance‑conscious AI coding tools.
Comments & Academic Discussion
Loading comments...
Leave a Comment