DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle
Even though demonstrating extraordinary capabilities in code generation and software issue resolving, AI agents’ capabilities in the full software DevOps cycle are still unknown. Different from pure code generation, handling the DevOps cycle in real-world software, including developing, deploying, and managing, requires analyzing large-scale projects, understanding dynamic program behaviors, leveraging domain-specific tools, and making sequential decisions. However, existing benchmarks focus on isolated problems and lack environments and tool interfaces for DevOps. We introduce DevOps-Gym, the first end-to-end benchmark for evaluating AI agents across core DevOps workflows: build and configuration, monitoring, issue resolving, and test generation. DevOps-Gym includes 700+ real-world tasks collected from 30+ projects in Java and Go. We develop a semi-automated data collection mechanism with rigorous and non-trivial expert efforts in ensuring the task coverage and quality. Our evaluation of state-of-the-art models and agents reveals fundamental limitations: they struggle with issue resolving and test generation in Java and Go, and remain unable to handle new tasks such as monitoring and build and configuration. These results highlight the need for essential research in automating the full DevOps cycle with AI agents.
💡 Research Summary
The paper introduces DevOps‑Gym, the first end‑to‑end benchmark designed to evaluate AI agents across the full software DevOps lifecycle. While large language models (LLMs) have shown impressive abilities in code generation and issue fixing, their competence in the operational phases—building, configuring, monitoring, and validating software—remains unclear. Existing benchmarks focus on isolated tasks (e.g., code synthesis, unit‑test generation) and lack realistic environments, tool‑calling interfaces, and multi‑step planning requirements that are essential for real‑world DevOps work.
DevOps‑Gym addresses this gap by collecting more than 700 real‑world tasks from over 30 open‑source projects written in Java and Go, languages that involve non‑trivial build systems, compilation steps, and mature monitoring toolchains. The benchmark defines four core stages: (1) Build & Configuration, where agents must invoke tools such as Maven, Gradle, npm, or Go modules to compile code, resolve dependency conflicts, and adjust build scripts; (2) Monitoring, which requires agents to run system‑level utilities (top, iostat, ps, pprof, etc.) to detect performance bottlenecks, memory leaks, CPU saturation, or handle exhaustion; (3) Issue Resolving, where agents diagnose the root cause of the observed anomaly, edit source code, and apply patches; and (4) Test Generation, in which agents automatically produce regression tests that validate the fix.
To test agents’ ability to handle complete pipelines, the authors also craft 18 end‑to‑end tasks that chain all four stages sequentially, forcing agents to retain context from the initial build through monitoring, debugging, and final test validation. The benchmark construction involved a semi‑automated pipeline: (i) manual analysis of >1,000 GitHub issues to categorize common failure patterns; (ii) synthesis of realistic failure scenarios where necessary; (iii) rigorous de‑duplication and contamination checks using prefix‑completion analysis to avoid leakage from pre‑training corpora; (iv) reconstruction of the exact runtime environment for each task, often requiring >10 hours of expert effort per monitoring or build issue; and (v) conversion of tasks into the TerminalBench format, providing a standardized command‑line tool interface and a suite of metrics (success/failure, step count, token usage, execution time).
The evaluation covered five state‑of‑the‑art LLMs (including GPT‑4, Claude‑2, Llama‑2, etc.) combined with four agentic frameworks (Auto‑GPT, ReAct, BabyAGI, OpenAI function‑calling), yielding twelve distinct agents. Results reveal fundamental limitations: the best agent achieved only 51.85 % success on build & configuration, 20.56 % on monitoring, 23.87 % on issue resolving, and a mere 13.87 % on test generation. Performance was notably lower on Java and Go than on Python‑centric benchmarks, suggesting that current models are not well‑adapted to compiled languages that require dependency resolution, compilation, and richer static analysis.
Key failure modes identified include: (a) poor high‑level planning—agents often cannot devise correct multi‑step sequences for building or monitoring; (b) incorrect or missing tool usage—agents treat DevOps‑specific CLI tools as out‑of‑distribution, leading to malformed commands; (c) limited ability to parse dynamic runtime information such as logs, metrics, and system states, which hampers monitoring and debugging; and (d) insufficient long‑context reasoning, especially when tasks require maintaining state across several stages.
The authors argue that these shortcomings stem from (1) a lack of DevOps‑specific tool‑use data in pre‑training corpora, (2) insufficient reinforcement‑learning or planning mechanisms that can handle hierarchical, goal‑directed workflows, and (3) the need for multimodal processing of code, logs, and performance metrics. They propose future research directions: incorporating large collections of tool‑call traces into model training, developing agents with persistent memory and hierarchical planning, and designing evaluation protocols that reward efficient, correct tool orchestration.
DevOps‑Gym, along with its open‑source evaluation framework and baseline implementations, is released to the community to catalyze progress toward truly autonomous AI agents capable of managing the entire software development and operations lifecycle.
Comments & Academic Discussion
Loading comments...
Leave a Comment