Comprehensive Evaluation of Large Language Models on Software Engineering Tasks: A Multi-Task Benchmark

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) have demonstrated remarkable capabilities in software engineering, yet comprehensive benchmarks covering diverse SE activities remain limited. We present a multi-task evaluation of 11 state-of-the-art LLMs across five representative software engineering tasks: bug fixing, feature development, code refactoring, technical copywriting, and research synthesis. Our automated verification framework measures both output quality and completion efficiency. Key findings reveal that (1) models achieving identical perfect scores exhibit 22x variation in completion time, 49x variation in tool efficiency, and 53x variation in estimated cost; (2) tool usage frequency shows no correlation with success (r = 0.077, p = 0.575) - one model used 917 tool calls while another solved the same task with 3 calls; (3) we identify two distinct inefficiency patterns: loop inefficiency and inference inefficiency; and (4) coding tasks achieve 100 percent success while research tasks present greater challenges (90.9 percent). We release all experimental data, verification scripts, and analysis code for full reproducibility.

💡 Research Summary

This paper presents a comprehensive multi‑task benchmark that evaluates eleven state‑of‑the‑art large language models (LLMs) on five representative software‑engineering activities: bug fixing, feature development, code refactoring, technical copywriting, and research synthesis. The authors built an automated verification framework that runs each model in a controlled agent environment equipped with five tools (read/write/replace files, execute shell commands, and web search for the research task). For each model‑task pair the framework records the final grade (EXCELLENT, PASS, FAIL), total execution time, number of tool invocations, tool diversity, and an estimated monetary cost derived from token usage and model pricing.

The model set spans four provider categories—OpenAI (GPT‑4o, GPT‑5.1, GPT‑5.2), Google (Gemini‑2.5 Flash/Pro, Gemini‑3 Flash/Pro), open‑weight models (Deepseek‑Chat, GLM‑4.7, Kimi‑K2.5, Qwen3‑VL)—and reflects the current best‑in‑class offerings for code‑related tasks. Each task is carefully crafted to mimic real‑world developer workflows while remaining fully automatable: the bug‑fix task introduces a race condition in a multithreaded inventory script; the feature task requires completing CRUD endpoints in a FastAPI Todo app; the refactor task asks for modularization of an ETL script; the copywriting task demands a markdown blog post covering specific product keywords; the research task asks for a 200‑plus‑word report on solid‑state batteries with citations obtained via web search.

Results show that overall success rates are extremely high (98.2 % across all models), with only Kimi‑K2.5 falling below 100 %. Four models (GPT‑5.1, Gemini‑3 Pro, Deepseek‑Chat, GLM‑4.7) achieve perfect scores (10/10), yet their efficiency varies dramatically: completion time ranges from 33 seconds (GPT‑4o) to 732 seconds (Qwen3‑VL), a 22‑fold difference; average tool calls span 3.8 to 188, a 49‑fold difference; estimated cost varies by a factor of 53. The authors introduce two novel efficiency ratios—Tool Efficiency Ratio (TER) and Time Efficiency Ratio (TER‑time)—to capture these disparities.

A key statistical finding is that tool usage frequency does not predict success: Pearson’s r = 0.077 (p = 0.575). For example, GPT‑5.1 solves the bug‑fix in 18.8 seconds with three tool calls, while Gemini‑3 Flash spends 625 seconds and makes 917 calls on the same problem, yet both receive EXCELLENT grades.

The paper identifies two distinct inefficiency patterns. “Loop inefficiency” occurs when an agent repeats a failing tool sequence without recognizing the error, inflating both time and call count. “Inference inefficiency” describes slow token generation even when the final answer is correct, leading to long runtimes despite low tool usage. These patterns are most pronounced in the research synthesis task, which also has the lowest success rate (90.9 %).

Task‑level analysis reveals that pure coding activities (bug fixing, feature implementation, refactoring) achieve 100 % success across all models, confirming that modern LLMs have mastered functional correctness for typical code‑level problems. Technical copywriting shows more variance, mainly due to missing required keywords (e.g., “K8s”). Research synthesis is the most challenging, with frequent missing citations and occasional execution errors during web search.

Provider‑wise, OpenAI models dominate the speed‑quality trade‑off, averaging 33 seconds per task and 9.33 points out of 10, while Deepseek‑Chat and GLM‑4.7, though accurate, are slower and costlier. Qwen3‑VL exhibits the worst efficiency profile.

The authors conclude that practitioners should evaluate LLMs not only on accuracy but also on time, tool efficiency, and cost, especially when integrating models into automated development pipelines. They recommend adding failure‑detection mechanisms to prevent loop inefficiency and optimizing token generation to mitigate inference inefficiency. All datasets, verification scripts, and analysis code are released publicly to enable reproducibility and future extensions.

Comprehensive Evaluation of Large Language Models on Software Engineering Tasks: A Multi-Task Benchmark

💡 Research Summary

Comments & Academic Discussion

Leave a Comment