SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents
Although large language models (LLMs) have demonstrated impressive coding capabilities, their ability to autonomously build production-scale software from explicit specifications remains an open question. We introduce SWE-AGI, an open-source benchmark for evaluating end-to-end, specification-driven construction of software systems written in MoonBit. SWE-AGI tasks require LLM-based agents to implement parsers, interpreters, binary decoders, and SAT solvers strictly from authoritative standards and RFCs under a fixed API scaffold. Each task involves implementing 1,000-10,000 lines of core logic, corresponding to weeks or months of engineering effort for an experienced human developer. By leveraging the nascent MoonBit ecosystem, SWE-AGI minimizes data leakage, forcing agents to rely on long-horizon architectural reasoning rather than code retrieval. Across frontier models, gpt-5.3-codex achieves the best overall performance (solving 19/22 tasks, 86.4%), outperforming claude-opus-4.6 (15/22, 68.2%), and kimi-2.5 exhibits the strongest performance among open-source models. Performance degrades sharply with increasing task difficulty, particularly on hard, specification-intensive systems. Behavioral analysis further reveals that as codebases scale, code reading, rather than writing, becomes the dominant bottleneck in AI-assisted development. Overall, while specification-driven autonomous software engineering is increasingly viable, substantial challenges remain before it can reliably support production-scale development.
💡 Research Summary
The paper introduces SWE‑AGI, an open‑source benchmark designed to evaluate large language model (LLM) agents that autonomously construct production‑scale software systems from explicit specifications. All tasks are written in MoonBit, a newly emerging language with a minimal presence in public code repositories, which deliberately reduces the chance of data leakage and forces agents to rely on genuine specification comprehension and long‑horizon architectural reasoning rather than simple code retrieval.
SWE‑AGI comprises 22 tasks spanning seven domains: template and domain‑specific languages, data serialization formats, markup/document formats, programming language front‑ends, binary formats and streaming decoders, networking protocol state machines, and automated reasoning/SAT solving. Each task requires implementing 1,000–10,000 lines of core logic (excluding tests), corresponding to weeks or months of work for an experienced developer. The tasks are organized into three difficulty tiers (6 easy, 8 medium, 8 hard) based on code volume and semantic complexity such as multi‑phase parsing, large state machines, and strict error‑recovery requirements.
A task is delivered as a starter repository containing: (i) a TASK.md file that states the goal, constraints, and acceptance criteria; (ii) a specs/ directory with authoritative RFCs, standards, or other reference documents; (iii) a declaration‑first API scaffold (using MoonBit’s declare keyword) that fixes the public interface; and (iv) a public test suite for rapid local iteration plus a hidden private test suite used only for final evaluation. Agents may use any tools (including web search) but must produce a final submission that compiles and passes all hidden tests; no intermediate human intervention is allowed.
The evaluation protocol measures several metrics: task success rate (fraction of tasks where the final submission passes all hidden tests), overall test‑suite pass rate, wall‑clock time to first successful submission, and core implementation size (LOC). Optional behavioral statistics (e.g., proportion of time spent reading specifications, reading code, writing code, debugging, testing, planning, or external search) are also collected from tool‑usage logs.
Performance results across a range of frontier and open‑source models are reported. The proprietary model gpt‑5.3‑codex achieves the highest overall performance, solving 19 out of 22 tasks (86.4 %). It outperforms its predecessor gpt‑5.2‑codex (17/22, 77.3 %) and Claude‑opus‑4.6 (15/22, 68.2 %). All models solve every easy‑tier task, but success rates drop sharply on medium and especially hard tiers. For hard tasks, gpt‑5.3‑codex solves only 2 out of 8, indicating a steep performance cliff as task complexity grows. Among open‑source models, Kimi‑2.5 performs best, solving 2 out of 6 easy‑tier tasks, while other models (Gemini‑3‑flash, DeepSeek‑v3.2, GLM‑4.7, Qwen‑3‑max, Claude‑sonnet‑4.5) solve at most one.
A detailed behavioral analysis reveals that, as codebases scale, code reading becomes the dominant bottleneck. In logged actions, reading specifications and existing code consumes more than half of the total time, while actual code writing occupies a smaller fraction. The newer gpt‑5.3‑codex exhibits a more iteration‑oriented profile: higher debugging share, fewer redundant actions, and faster time‑to‑solution compared with gpt‑5.2‑codex, which spends more time on code comprehension. This suggests that improving agents’ ability to efficiently parse and internalize large specifications will be crucial for future progress.
The authors justify the choice of MoonBit on two grounds: (1) its nascent ecosystem minimizes the risk that pre‑training data already contains near‑identical implementations, thereby enforcing genuine reasoning; (2) MoonBit’s type‑soundness, unified build/test toolchain, and declaration‑first workflow naturally align with a specification‑driven development process.
In conclusion, SWE‑AGI demonstrates that autonomous, specification‑driven software engineering is becoming feasible for modest‑scale tasks, but substantial challenges remain for production‑scale, hard‑tier systems. Key open problems include long‑term memory management, modular architectural planning, robust error handling, and efficient specification‑to‑code mapping. SWE‑AGI provides a rigorous, retrieval‑resistant benchmark that can guide future research on LLM‑based software agents, encouraging advances that move beyond pattern matching toward true engineering competence.
Comments & Academic Discussion
Loading comments...
Leave a Comment